Collect, aggregate, and move large amounts of data into HDFS.
Flume is a distributed, reliable, and available service for efficiently
collecting, aggregating, and moving large amounts of log data. It has a simple
and flexible architecture based on streaming data flows. It is robust and fault
tolerant with tunable reliability mechanisms and many failover and recovery
mechanisms. It uses a simple extensible data model that allows for online
analytic application. Learn more at flume.apache.org.
This charm provides a Flume agent designed to ingest events into the shared
filesystem (HDFS) of a connected Hadoop cluster. It is meant to relate to
other Flume agents such as
This charm is uses the hadoob base layer and the hdfs interface to pull its dependencies
and act as a client to a hadoop namenode:
You may manually deploy the recommended environment as follows:
juju deploy apache-hadoop-namenode namenode juju deploy apache-hadoop-resourcemanager resourcemgr juju deploy apache-hadoop-slave slave juju deploy apache-hadoop-plugin plugin juju add-relation namenode slave juju add-relation resourcemgr slave juju add-relation resourcemgr namenode juju add-relation plugin resourcemgr juju add-relation plugin namenode
Deploy Flume HDFS:
juju deploy apache-flume-hdfs flume-hdfs juju add-relation flume-hdfs plugin
The deployment at this stage isn't very exciting, as the
is waiting for other Flume agents to connect and send data. You'll probably
want to check out
to provide additional functionality for this deployment.
flume-hdfs receives data, it is stored in a
HDFS subdirectory (configured by the connected Flume charm). You can quickly
verify the data written to HDFS using the command line. SSH to the
unit, locate an event, and cat it:
juju ssh flume-hdfs/0 hdfs dfs -ls /user/flume/<event_dir> # <-- find a date hdfs dfs -ls /user/flume/<event_dir>/<yyyy-mm-dd> # <-- find an event hdfs dfs -cat /user/flume/<event_dir>/<yyyy-mm-dd>/FlumeData.<id>
This process works well for data serialized in
text format (the default).
For data serialized in
avro format, you'll need to copy the file locally
and use the
dfs -text command. For example, replace the
dfs -cat command
from above with the following to view files stored in
hdfs dfs -copyToLocal /user/flume/<event_dir>/<yyyy-mm-dd>/FlumeData.<id> /home/ubuntu/myFile.txt hdfs dfs -text file:///home/ubuntu/myFile.txt
- (string) The maximum number of events the channel will take from a source or give to a sink per transaction.
- (string) Ingestion protocol for the agent source. Currently only 'avro' is supported.
- (string) URL from which to fetch resources (e.g., Flume binaries) instead of S3
- (int) Number of events written to file before it is rolled. A value of 0 (the default) means never roll based on number of events.
- (string) File size to trigger roll, in bytes. Default will roll the file once it reaches 10 MB. A value of 0 means never roll based on file size.
- (int) Port on which the agent source is listening.
- (int) Number of seconds to wait before rolling the current file. Default will roll the file after 5 minutes. A value of 0 means never roll based on a time interval.
- (string) Specify the serializer used when the sink writes to HDFS. Either 'avro_event' or 'text' are supported.
- (string) The maximum number of events stored in the channel.
- (string) Compression codec for the agent sink. An empty value will write events to HDFS uncompressed. You may specify 'snappy' here to compress written events using the snappy codec.
- (int) The DFS replication value. The default (3) is the same as the Namenode provided by apache-hadoop-hdfs-master, but may be overriden for this service.