Uses a Twitter source, memory channel, and Avro sink in Apache Flume
to ingest Twitter data.

Overview

Flume is a distributed, reliable, and available service for efficiently
collecting, aggregating, and moving large amounts of log data. It has a simple
and flexible architecture based on streaming data flows. It is robust and fault
tolerant with tunable reliability mechanisms and many failover and recovery
mechanisms. It uses a simple extensible data model that allows for online
analytic application. Learn more at flume.apache.org.

This charm provides a Flume agent designed to process tweets from the Twitter
Streaming API and send them to the apache-flume-hdfs agent for storage in
the shared filesystem (HDFS) of a connected Hadoop cluster. This leverages the
TwitterSource jar packaged with Flume. Learn more about the
1% firehose.

Prerequisites

The Twitter Streaming API requires developer credentials. You'll need to
configure those for this charm. Find your credentials (or create an account
if needed) here.

Create a secret.yaml file with your Twitter developer credentials:

flume-twitter:
    twitter_access_token: 'YOUR_TOKEN'
    twitter_access_token_secret: 'YOUR_TOKEN_SECRET'
    twitter_consumer_key: 'YOUR_CONSUMER_KEY'
    twitter_consumer_secret: 'YOUR_CONSUMER_SECRET'

Usage

This charm leverages our pluggable Hadoop model with the hadoop-plugin
interface. This means that you will need to deploy a base Apache Hadoop cluster
to run Flume. The suggested deployment method is to use the
apache-ingestion-flume
bundle. This will deploy the Apache Hadoop platform with a single Apache Flume
unit that communicates with the cluster by relating to the
apache-hadoop-plugin subordinate charm:

juju quickstart u/bigdata-dev/apache-ingestion-flume

Alternatively, you may manually deploy the recommended environment as follows:

juju deploy apache-hadoop-hdfs-master hdfs-master
juju deploy apache-hadoop-yarn-master yarn-master
juju deploy apache-hadoop-compute-slave compute-slave
juju deploy apache-hadoop-plugin plugin
juju deploy apache-flume-hdfs flume-hdfs

juju add-relation yarn-master hdfs-master
juju add-relation compute-slave yarn-master
juju add-relation compute-slave hdfs-master
juju add-relation plugin yarn-master
juju add-relation plugin hdfs-master
juju add-relation flume-hdfs plugin

Now that the base environment has been deployed (either via quickstart or
manually), you are ready to add the apache-flume-twitter charm and
relate it to the flume-hdfs agent:

juju deploy apache-flume-twitter flume-twitter --config=secret.yaml
juju add-relation flume-twitter flume-hdfs

That's it! Once the Flume agents start, tweets will start flowing into
HDFS via the flume-twitter and flume-hdfs charms. Flume may include
multiple events in each file written to HDFS. This is configurable with various
options on the flume-hdfs charm. See descriptions of the roll_* options on
the apache-flume-hdfs charm store
page for more details.

Flume will write files to HDFS in the following location:
/user/flume/<event_dir>/<yyyy-mm-dd>/FlumeData.<id>. The <event_dir>
subdirectory is configurable and set to flume-twitter by default for this
charm.

Test the deployment

To verify this charm is working as intended, SSH to the flume-hdfs unit and
locate an event:

juju ssh flume-hdfs/0
hdfs dfs -ls /user/flume/<event_dir>               # <-- find a date
hdfs dfs -ls /user/flume/<event_dir>/<yyyy-mm-dd>  # <-- find an event

Since our tweets are serialized in avro format, you'll need to copy the file
locally and use the dfs -text command to view it:

hdfs dfs -copyToLocal /user/flume/<event_dir>/<yyyy-mm-dd>/FlumeData.<id>.avro /home/ubuntu/myFile.txt
hdfs dfs -text file:///home/ubuntu/myFile.txt

You may not recognize the body of the tweet if it's not in a language you
understand (remember, this is a 1% firehose from tweets all over the world).
You may have to try a few different events before you find a tweet worth
reading. Happy hunting!

Contact Information

Help

Configuration

channel_transaction_capacity
(string)
                            The maximum number of events the channel will take from a source or
give to a sink per transaction.

                        
100
twitter_max_batch_duration
(int)
                            Maximum number of milliseconds to wait before closing a batch

                        
1000
event_dir
(string)
                            The HDFS subdirectory under /user/flume where events will be stored.

                        
flume-twitter
twitter_access_token
(string)
                            OAuth Access token from your Twitter developer account

                        
twitter_consumer_key
(string)
                            OAuth Consumer key from your Twitter developer account

                        
resources_mirror
(string)
                            URL from which to fetch resources (e.g., Hadoop binaries) instead
of Launchpad.

                        
twitter_consumer_secret
(string)
                            OAth Consumer secret from your Twitter developer account

                        
twitter_source
(string)
                            The application to use for this Flume source. Deafult to TwitterSource
bundled with Flume.

                        
org.apache.flume.source.twitter.TwitterSource
channel_capacity
(string)
                            The maximum number of events stored in the channel.

                        
1000
twitter_max_batch_size
(int)
                            Maximum number of twitter messages to put in a single batch

                        
1000
twitter_access_token_secret
(string)
                            OAuth Access token secret from your Twitter developer account