Description
Uses a Twitter source, memory channel, and Avro sink in Apache Flume
to ingest Twitter data.
Overview
Flume is a distributed, reliable, and available service for efficiently
collecting, aggregating, and moving large amounts of log data. It has a simple
and flexible architecture based on streaming data flows. It is robust and fault
tolerant with tunable reliability mechanisms and many failover and recovery
mechanisms. It uses a simple extensible data model that allows for online
analytic application. Learn more at flume.apache.org.
This charm provides a Flume agent designed to process tweets from the Twitter
Streaming API and send them to the apache-flume-hdfs agent for storage in
the shared filesystem (HDFS) of a connected Hadoop cluster. This leverages the
TwitterSource jar packaged with Flume. Learn more about the
1% firehose.
Prerequisites
The Twitter Streaming API requires developer credentials. You'll need to
configure those for this charm. Find your credentials (or create an account
if needed) here.
Create a secret.yaml file with your Twitter developer credentials:
flume-twitter:
twitter_access_token: 'YOUR_TOKEN'
twitter_access_token_secret: 'YOUR_TOKEN_SECRET'
twitter_consumer_key: 'YOUR_CONSUMER_KEY'
twitter_consumer_secret: 'YOUR_CONSUMER_SECRET'
Usage
This charm leverages our pluggable Hadoop model with the hadoop-plugin
interface. This means that you will need to deploy a base Apache Hadoop cluster
to run Flume. The suggested deployment method is to use the
apache-ingestion-flume
bundle. This will deploy the Apache Hadoop platform with a single Apache Flume
unit that communicates with the cluster by relating to the
apache-hadoop-plugin subordinate charm:
juju quickstart u/bigdata-dev/apache-ingestion-flume
Alternatively, you may manually deploy the recommended environment as follows:
juju deploy apache-hadoop-hdfs-master hdfs-master
juju deploy apache-hadoop-yarn-master yarn-master
juju deploy apache-hadoop-compute-slave compute-slave
juju deploy apache-hadoop-plugin plugin
juju deploy apache-flume-hdfs flume-hdfs
juju add-relation yarn-master hdfs-master
juju add-relation compute-slave yarn-master
juju add-relation compute-slave hdfs-master
juju add-relation plugin yarn-master
juju add-relation plugin hdfs-master
juju add-relation flume-hdfs plugin
Once the bundle has been deployed, add the apache-flume-twitter charm and
relate it to the flume-hdfs agent:
juju deploy apache-flume-twitter flume-twitter --config=secret.yaml
juju add-relation flume-twitter flume-hdfs
That's it! Once the Flume agents start, tweets will start flowing into
HDFS in year-month-day/hour directories here: /user/flume/events/%y-%m-%d/%H.
Test the deployment
To verify this charm is working as intended, SSH to the flume-hdfs unit,
locate an event, and cat it:
juju ssh flume-hdfs/0
hdfs dfs -ls /user/flume/events # <-- find a date
hdfs dfs -ls /user/flume/events/yy-mm-dd # <-- find an hour
hdfs dfs -ls /user/flume/events/yy-mm-dd/HH # <-- find an event
hdfs dfs -cat /user/flume/events/yy-mm-dd/HH/FlumeData.[id].avro
You'll see AVRO headers since that's the default format used to contain the
tweets. You may not recognize the body of the tweet if it's not in a
language you understand (remember, this is a 1% firehose from tweets all over
the world). You may have to cat a few different events before you find a
tweet worth reading. Happy hunting!
Contact Information
Help
- Apache Flume home page
- Apache Flume bug tracker
- Apache Flume mailing lists
#jujuonirc.freenode.net
Configuration
- channel_transaction_capacity
- (string) The maximum number of events the channel will take from a source or give to a sink per transaction.
- 100
- twitter_max_batch_duration
- (int) Maximum number of milliseconds to wait before closing a batch
- 1000
- event_dir
- (string) The HDFS subdirectory under /user/flume where events will be stored.
- flume-twitter
- twitter_access_token
- (string) OAuth Access token from your Twitter developer account
- twitter_consumer_key
- (string) OAuth Consumer key from your Twitter developer account
- resources_mirror
- (string) URL from which to fetch resources (e.g., Hadoop binaries) instead of Launchpad.
- twitter_consumer_secret
- (string) OAth Consumer secret from your Twitter developer account
- twitter_source
- (string) The application to use for this Flume source. Deafult to TwitterSource bundled with Flume.
- org.apache.flume.source.twitter.TwitterSource
- channel_capacity
- (string) The maximum number of events stored in the channel.
- 1000
- twitter_max_batch_size
- (int) Maximum number of twitter messages to put in a single batch
- 1000
- twitter_access_token_secret
- (string) OAuth Access token secret from your Twitter developer account