apache ingestion flume #1

Big Data Ingestion with Apache Flume

This bundle is a 7 node cluster designed to scale out. Built around Apache
Hadoop components, it contains the following units:

  • 1 HDFS Master
  • 1 HDFS Secondary Namenode
  • 1 YARN Master
  • 3 Compute Slaves
  • 1 Flume-HDFS
  • 1 Plugin (colocated on the Flume unit)

The Flume-HDFS unit provides an Apache Flume agent featuring an Avro source,
memory channel, and HDFS sink. This agent supports relations with the
apache-flume-twitter and apache-flume-syslog charms to ingest Twitter
and remote syslog data, respectively, into HDFS.

Usage

Deploy this bundle using juju-quickstart:

juju quickstart u/bigdata-dev/apache-ingestion-flume

See juju quickstart --help for deployment options, including machine
constraints and how to deploy a locally modified version of the
apache-ingestion-flume bundle.yaml.

Testing the deployment

Smoke test HDFS admin functionality

Once the deployment is complete and the cluster is running, ssh to the HDFS
Master unit:

juju ssh hdfs-master/0

As the ubuntu user, create a temporary directory on the Hadoop file system.
The steps below verify HDFS functionality:

hdfs dfs -mkdir -p /tmp/hdfs-test
hdfs dfs -chmod -R 777 /tmp/hdfs-test
hdfs dfs -ls /tmp # verify the newly created hdfs-test subdirectory exists
hdfs dfs -rm -R /tmp/hdfs-test
hdfs dfs -ls /tmp # verify the hdfs-test subdirectory has been removed
exit

Smoke test YARN and MapReduce

Run the terasort.sh script from the Flume unit to generate and sort data. The
steps below verify that Flume is communicating with the cluster via the plugin
and that YARN and MapReduce are working as expected:

juju ssh flume-hdfs/0
~/terasort.sh
exit

Smoke test HDFS functionality from user space

From the Flume unit, delete the MapReduce output previously generated by the
terasort.sh script:

juju ssh flume-hdfs/0
hdfs dfs -rm -R /user/ubuntu/tera_demo_out
exit

Smoke test Flume

SSH to the Flume unit and verify the flume-ng java process is running:

juju ssh flume-hdfs/0
ps -ef | grep flume-ng # verify process is running
exit

Scale Out Usage

This bundle was designed to scale out. To increase the amount of Compute
Slaves, you can add units to the compute-slave service. To add one unit:

juju add-unit compute-slave

You can also add multiple units, for examle, to add four more compute slaves:

juju add-unit -n4 compute-slave

Contact Information

Help