Description
Collect, aggregate, and move large amounts of data into HDFS.
Overview
Flume is a distributed, reliable, and available service for efficiently
collecting, aggregating, and moving large amounts of log data. It has a simple
and flexible architecture based on streaming data flows. It is robust and fault
tolerant with tunable reliability mechanisms and many failover and recovery
mechanisms. It uses a simple extensible data model that allows for online
analytic application. Learn more at flume.apache.org.
This charm provides a Flume agent designed to ingest events into the shared
filesystem (HDFS) of a connected Hadoop cluster. It is meant to relate to
other Flume agents such as apache-flume-syslog and apache-flume-twitter.
Usage
This charm is uses the hadoob base layer and the hdfs interface to pull its dependencies
and act as a client to a hadoop namenode:
You may manually deploy the recommended environment as follows:
juju deploy apache-hadoop-namenode namenode
juju deploy apache-hadoop-resourcemanager resourcemgr
juju deploy apache-hadoop-slave slave
juju deploy apache-hadoop-plugin plugin
juju add-relation namenode slave
juju add-relation resourcemgr slave
juju add-relation resourcemgr namenode
juju add-relation plugin resourcemgr
juju add-relation plugin namenode
Deploy Flume HDFS:
juju deploy apache-flume-hdfs flume-hdfs
juju add-relation flume-hdfs plugin
The deployment at this stage isn't very exciting, as the flume-hdfs service
is waiting for other Flume agents to connect and send data. You'll probably
want to check out
apache-flume-syslog
or
apache-flume-twitter
to provide additional functionality for this deployment.
When flume-hdfs receives data, it is stored in a /user/flume/<event_dir>
HDFS subdirectory (configured by the connected Flume charm). You can quickly
verify the data written to HDFS using the command line. SSH to the flume-hdfs
unit, locate an event, and cat it:
juju ssh flume-hdfs/0
hdfs dfs -ls /user/flume/<event_dir> # <-- find a date
hdfs dfs -ls /user/flume/<event_dir>/<yyyy-mm-dd> # <-- find an event
hdfs dfs -cat /user/flume/<event_dir>/<yyyy-mm-dd>/FlumeData.<id>
This process works well for data serialized in text format (the default).
For data serialized in avro format, you'll need to copy the file locally
and use the dfs -text command. For example, replace the dfs -cat command
from above with the following to view files stored in avro format:
hdfs dfs -copyToLocal /user/flume/<event_dir>/<yyyy-mm-dd>/FlumeData.<id> /home/ubuntu/myFile.txt
hdfs dfs -text file:///home/ubuntu/myFile.txt
Contact Information
Help
- Apache Flume home page
- Apache Flume bug tracker
- Apache Flume mailing lists
#jujuonirc.freenode.net
Configuration
- channel_transaction_capacity
- (string) The maximum number of events the channel will take from a source or give to a sink per transaction.
- 100
- protocol
- (string) Ingestion protocol for the agent source. Currently only 'avro' is supported.
- avro
- resources_mirror
- (string) URL from which to fetch resources (e.g., Flume binaries) instead of S3
- roll_count
- (int) Number of events written to file before it is rolled. A value of 0 (the default) means never roll based on number of events.
- roll_size
- (string) File size to trigger roll, in bytes. Default will roll the file once it reaches 10 MB. A value of 0 means never roll based on file size.
- 10000000
- source_port
- (int) Port on which the agent source is listening.
- 4141
- roll_interval
- (int) Number of seconds to wait before rolling the current file. Default will roll the file after 5 minutes. A value of 0 means never roll based on a time interval.
- 300
- sink_serializer
- (string) Specify the serializer used when the sink writes to HDFS. Either 'avro_event' or 'text' are supported.
- text
- channel_capacity
- (string) The maximum number of events stored in the channel.
- 1000
- sink_compression
- (string) Compression codec for the agent sink. An empty value will write events to HDFS uncompressed. You may specify 'snappy' here to compress written events using the snappy codec.
- dfs_replication
- (int) The DFS replication value. The default (3) is the same as the Namenode provided by apache-hadoop-hdfs-master, but may be overriden for this service.
- 3