apache flume rabbitmq #7

  • By bigdata-dev
  • Latest version (#7)
  • trusty
  • Stable
  • Edge

Description

Uses a RabbitMQ source, memory channel, and Avro sink in Apache Flume
to ingest messages published to a RabbitMQ topic.


Overview

Flume is a distributed, reliable, and highly-available service for efficiently
collecting, aggregating, and moving large amounts of log data. It has a simple
and flexible architecture based on streaming data flows. It is robust and fault
tolerant with tunable reliability, failover, and recovery mechanisms. It uses a
simple extensible data model that allows for online analytic application. Learn
more at flume.apache.org.

This charm provides a Flume agent designed to ingest messages published to
a RabbitMQ queue and send them to the apache-flume-hdfs agent for storage in
the shared filesystem (HDFS) of a connected Hadoop cluster. This utilizes a
RabbitMQ-Flume Plugin.

Deployment

This charm leverages our pluggable Hadoop model with the hadoop-plugin
interface. A base Apache Hadoop cluster is required. The suggested deployment
method is to use the
apache-ingestion-flume-rabbitmq
bundle.

Bundle Deployment

This will deploy the Apache Hadoop platform with a pair of Apache Flume
agents that facilitate communication between RabbitMQ and HDFS:

juju quickstart u/bigdata-dev/apache-ingestion-flume-rabbitmq

Manual Deployment

You may manually deploy the recommended environment as follows:

juju deploy apache-hadoop-hdfs-master hdfs-master
juju deploy apache-hadoop-yarn-master yarn-master
juju deploy apache-hadoop-compute-slave compute-slave
juju deploy apache-hadoop-plugin plugin
juju deploy apache-flume-hdfs flume-hdfs
juju deploy rabbitmq-server rabbitmq

juju add-relation yarn-master hdfs-master
juju add-relation compute-slave yarn-master
juju add-relation compute-slave hdfs-master
juju add-relation plugin yarn-master
juju add-relation plugin hdfs-master
juju add-relation flume-hdfs plugin

Continue manual deployment by colocating the flume-rabbitmq charm on the
rabbitmq unit:

RABBIT_MACHINE_ID=$(juju status rabbitmq --format tabular | grep "rabbitmq/" | awk '{ print $5 }')
juju deploy --to ${RABBIT_MACHINE_ID} apache-flume-rabbitmq flume-rabbitmq

Finally, complete manual deployment by relating the flume-rabbitmq charm to
both flume-hdfs and rabbitmq:

juju add-relation flume-rabbitmq rabbitmq
juju add-relation flume-rabbitmq flume-hdfs

Usage

When flume-hdfs receives data, it is stored in a /user/flume/<event_dir>
HDFS subdirectory (configured by the connected Flume charm). The <event_dir>
subdirectory is set to flume-rabbitmq by default for this charm. You can
quickly verify the data written to HDFS using the command line. SSH to the
flume-hdfs unit, locate an event, and cat it:

juju ssh flume-hdfs/0
hdfs dfs -ls /user/flume/flume-rabbitmq               # <-- find a date
hdfs dfs -ls /user/flume/flume-rabbitmq/<yyyy-mm-dd>  # <-- find an event
hdfs dfs -cat /user/flume/flume-rabbitmq/<yyyy-mm-dd>/FlumeData.<id>

This process works well for data serialized in text format (the default).
For data serialized in avro format, you'll need to copy the file locally
and use the dfs -text command. For example, replace the dfs -cat command
from above with the following to view files stored in avro format:

hdfs dfs -copyToLocal /user/flume/<event_dir>/<yyyy-mm-dd>/FlumeData.<id> /home/ubuntu/myFile.txt
hdfs dfs -text file:///home/ubuntu/myFile.txt

Configure the environment

The default RabbitMQ queue and virtualhost where messages are published is
unset. Set this to an existing RabbitMQ queue name as follows:

juju set flume-rabbitmq rabbitmq_queuename='<queue_name>' rabbitmq_vhost='<vhost_name>'

If you changed the access to the Management GUI on RabbitMQ, you can also specify your creds with

juju set flume-rabbitmq rabbitmq_username='<user_name>' rabbitmq_password='<user_password>'

Test the deployment

Generate Rabbit messages on the flume-rabbitmq unit with the producer script:

juju set flume-rabbitmq rabbitmq_queuename='logs'
juju ssh flume-rabbitmq/0
cd /var/lib/juju/agents/unit-rabbitmq-0/charm/scripts
while read line ; do ./t1/send_log.py info $line; done < /var/log/syslog

Note that if you did not collocate your flume agent on RabbitMQ, you'll need to
update this script with the private IP address of the RabbitMQ server.

To verify these messages are being stored into HDFS, SSH to the flume-hdfs
unit, locate an event, and cat it:

juju ssh flume-hdfs/0
hdfs dfs -ls /user/flume/flume-rabbitmq  # <-- find a date
hdfs dfs -ls /user/flume/flume-rabbitmq/yyyy-mm-dd  # <-- find an event
hdfs dfs -cat /user/flume/flume-rabbitmq/yyyy-mm-dd/FlumeData.[id]

Contact Information

Help

Configuration

rabbitmq_virtualhost
(string) RabbitMQ virtualhost to connect to the queue
channel_transaction_capacity
(string) The maximum number of events the channel will take from a source or give to a sink per transaction.
100
rabbitmq_username
(string) RabbitMQ user for the source (could be passed by relation over time)
guest
event_dir
(string) The HDFS subdirectory under /user/flume where events will be stored.
flume-rabbitmq
rabbitmq_exchangename
(string) RabbitMQ Exchange for the source (empty by default, could be passed by relation over time)
resources_mirror
(string) URL from which to fetch resources (e.g., Hadoop binaries) instead of Launchpad.
channel_capacity
(string) The maximum number of events stored in the channel.
1000
rabbitmq_password
(string) RabbitMQ password to connect to the queue
guest
rabbitmq_queuename
(string) Queue to connect to on the RabbitMQ server
rabbitmq