Hadoop is a software platform that lets one easily write and
run applications that process vast amounts of data.

Overview

What is Apache Hadoop?
The Apache Hadoop software library is a framework that allows for the
distributed processing of large data sets across clusters of computers
using a simple programming model.

It is designed to scale up from single servers to thousands of machines,
each offering local computation and storage. Rather than rely on hardware
to deliver high-avaiability, the library itself is designed to detect
and handle failures at the application layer, so delivering a
highly-availabile service on top of a cluster of computers, each of
which may be prone to failures.

Apache Hadoop 2.2.0 consists of significant improvements over the previous stable release (hadoop-1.x).

Here is a short overview of the improvments to both HDFS and MapReduce.
- HDFS Federation
In order to scale the name service horizontally, federation uses multiple independent
Namenodes/Namespaces. The Namenodes are federated, that is, the Namenodes are independent
and don't require coordination with each other. The datanodes are used as common storage for
blocks by all the Namenodes. Each datanode registers with all the Namenodes in the cluster.
Datanodes send periodic heartbeats and block reports and handles commands from the Namenodes.

More details are available in the HDFS Federation document:
http://hadoop.apache.org/docs/r2.2.0/hadoop-project-dist/hadoop-hdfs/Federation.html

  • MapReduce NextGen aka YARN aka MRv2
    The new architecture introduced in hadoop-0.23, divides the two major functions of the
    JobTracker: resource management and job life-cycle management into separate components.
    The new ResourceManager manages the global assignment of compute resources to
    applications and the per-application ApplicationMaster manages the application‚
    scheduling and coordination.
    An application is either a single job in the sense of classic MapReduce jobs or a DAG of
    such jobs.

The ResourceManager and per-machine NodeManager daemon, which manages the user processes on
that machine, form the computation fabric.

The per-application ApplicationMaster is, in effect, a framework specific library and is
tasked with negotiating resources from the ResourceManager and working with the NodeManager
(s) to execute and monitor the tasks.

More details are available in the YARN document:
http://hadoop.apache.org/docs/r2.2.0/hadoop-yarn/hadoop-yarn-site/YARN.html

Usage

This charm supports the following Hadoop roles:

  • HDFS: namenode, secondarynamenode and datanode ( TBD HDFS Federation)
  • YARN: ResourceManager, NodeManager

This supports deployments of Hadoop in a number of configurations.

Simple Usage: Combined HDFS and MapReduce

In this configuration, the YARN ResourceManager is deployed on the same
service units as HDFS namenode and the HDFS datanodes also run YARN NodeManager::

juju deploy hadoop hadoop-master
juju deploy hadoop hadoop-slavecluster
juju add-unit -n 2 hadoop-slavecluster
juju add-relation hadoop-master:namenode hadoop-slavecluster:datanode
juju add-relation hadoop-master:resourcemanager hadoop-slavecluster:nodemanager

Scale Out Usage: Separate HDFS and MapReduce

In this configuration the HDFS and YARN deployments operate on
different service units as separate services::

juju deploy hadoop hdfs-namenode
juju deploy hadoop hdfs-datacluster
juju add-unit -n 2 hdfs-datacluster
juju add-relation hdfs-namenode:namenode hdfs-datacluster:datanode

juju deploy hadoop mapred-resourcemanager
juju deploy hadoop mapred-taskcluster
juju add-unit -n 2 mapred-taskcluster
juju add-relation mapred-resourcemanager:mapred-namenode hdfs-namenode:namenode
juju add-relation mapred-taskcluster:mapred-namenode hdfs-namenode:namenode    
juju add-relation mapred-resourcemanager:resourcemanager mapred-taskcluster:nodemanager

In the long term juju should support improved placement of services to
better support this type of deployment. This would allow mapreduce services
to be deployed onto machines with more processing power and hdfs services
to be deployed onto machines with larger storage.

TO deploy a Hadoop service with elasticsearch service::

# deploy ElasticSearch locally:
juju deploy elasticsearch elasticsearch
# elasticsearch-hadoop.jar file will be added to LIBJARS path 
# Recommanded to use hadoop -libjars option to included elk jar file
juju add-unit -n elasticsearch
# deploy hive service by any senarios mentioned above
# associate Hive with elasticsearch
juju add-relation hadoop-master:elasticsearch elasticsearch:client

Known Limitations and Issues

Note that removing the relation between namenode and datanode is destructive!
The role of the service is determined at the point that the relation is added
(it must be qualified) and CANNOT be changed later!

A single hdfs-master can support multiple slave service deployments::

juju deploy hadoop hdfs-datacluster-02
juju add-unit -n 2 hdfs-datacluster-02
juju add-relation hdfs-namenode:namenode hdfs-datacluster-02:datanode

Configuration

dfs_namenode_handler_count:
The number of server threads for the namenode. Increase this in larger
deployments to ensure the namenode can cope with the number of datanodes
that it has to deal with.
dfs_replication:
Default block replication. The actual number of replications can be specified when
the file is created. The default is used if replication is not specified in create time
dfs_block_size:
The default block size for new files (default to 64MB). Increase this in
larger deployments for better large data set performance.
io_file_buffer_size:
The size of buffer for use in sequence files. The size of this buffer should
probably be a multiple of hardware page size (4096 on Intel x86), and it
determines how much data is buffered during read and write operations.
dfs_datanode_max_xcievers:
The number of files that an datanode will serve at any one time.
An Hadoop HDFS datanode has an upper bound on the number of files that it
will serve at any one time. This defaults to 256 (which is low) in hadoop
1.x - however this charm increases that to 4096.
mapreduce_framework_name:
Execution framework set to Hadoop YARN.
mapreduce_reduce_shuffle_parallelcopies:
The default number of parallel transfers run by reduce during the
copy(shuffle) phase.
mapred_child_java_opts:
Java opts for the task tracker child processes. The following symbol,
if present, will be interpolated: @taskid@ is replaced by current TaskID.
Any other occurrences of '@' will go unchanged. For example, to enable
verbose gc logging to a file named for the taskid in /tmp and to set
the heap maximum to be a gigabyte, pass a 'value' of:
.
-Xmx1024m -verbose:gc -Xloggc:/tmp/@taskid@.gc
.
The configuration variable mapred.child.ulimit can be used to control
the maximum virtual memory of the child processes.
mapreduce_task_io_sort_factor:
More streams merged at once while sorting files.. This
determines the number of open file handles.
mapreduce_task_io_sort_mb:
Higher memory-limit while sorting data for efficiency..
mapred_job_tracker_handler_count:
The number of server threads for the JobTracker. This should be roughly
4% of the number of tasktracker nodes.
tasktracker_http_threads:
The number of worker threads that for the http server. This is used for
map output fetching.
hadoop_dir_base:
The directory under which all other hadoop data is stored. Use this
to take advantage of extra storage that might be avaliable.
.
You can change this in a running deployment but all existing data in
HDFS will be inaccessible; you can of course switch it back if you
do this by mistake.
yarn_nodemanager_aux-services:
Shuffle service that needs to be set for Map Reduce applications.
yarn_nodemanager_aux-services_mapreduce_shuffle_class:
Shuffle service that needs to be set for Map Reduce applications.

Contact Information

amir sanjar amir.sanjar@canonical.com

Hadoop

Configuration

dfs_datanode_max_xcievers
(int)
                            The number of files that an datanode will serve at any one time.
An Hadoop HDFS datanode has an upper bound on the number of files that it
will serve at any one time. This defaults to 256 (which is low) in hadoop
1.x - however this charm increases that to 4096.

                        
4096
pig
(boolean)
                            To install Apache Pig on all service units alongside Hadoop set this
configuration to 'True'.
.
Apache Pig is a platform for analyzing large data sets that consists
of a high-level language for expressing data analysis programs, coupled
with infrastructure for evaluating these programs. The salient property
of Pig programs is that their structure is amenable to substantial
parallelization, which in turns enables them to handle very large data
sets.

                        
mapred_job_tracker_handler_count
(int)
                            The number of server threads for the JobTracker. This should be roughly
4% of the number of tasktracker nodes.

                        
10
mapreduce_framework_name
(string)
                            Execution framework set to Hadoop YARN.** DO NOT CHANGE **

                        
yarn
hbase
(boolean)
                            Setting this configuration parameter to 'True' configures HDFS for
use with HBase including turning on 'append' mode which is not
desirable in all deployment scenarios.

                        
tasktracker_http_threads
(int)
                            The number of worker threads that for the http server. This is used for
map output fetching.

                        
40
mapreduce_reduce_shuffle_parallelcopies
(int)
                            The default number of parallel transfers run by reduce during the
copy(shuffle) phase.

                        
5
source
(string)
                            Location and packages to install hadoop:
.
 * dev:     Install using the hadoop packages from
            ppa:hadoop-ubuntu/dev.
 * testing: Install using the hadoop packages from
            ppa:hadoop-ubuntu/testing.
 * stable:  Install using the hadoop packages from
            ppa:hadoop-ubuntu/stable.
.
The packages provided in the hadoop-ubuntu team PPA's are based
directly on upstream hadoop releases but are not fully built from
source.

                        
stable
hadoop_dir_base
(string)
                            The directory under which all other hadoop data is stored.  Use this
to take advantage of extra storage that might be avaliable.
.
You can change this in a running deployment but all existing data in
HDFS will be inaccessible; you can of course switch it back if you
do this by mistake.

                        
/usr/local/hadoop/data
mapred_child_java_opts
(string)
                            Java opts for the task tracker child processes. The following symbol,
if present, will be interpolated: @taskid@ is replaced by current TaskID.
Any other occurrences of '@' will go unchanged. For example, to enable
verbose gc logging to a file named for the taskid in /tmp and to set
the heap maximum to be a gigabyte, pass a 'value' of:
.
  -Xmx1024m -verbose:gc -Xloggc:/tmp/@taskid@.gc
.
The configuration variable mapred.child.ulimit can be used to control
the maximum virtual memory of the child processes.

                        
-Xmx200m
dfs_replication
(int)
                            Default block replication. The actual number of replications can be specified when
the file is created. The default is used if replication is not specified in create time

                        
1
yarn_nodemanager_aux-services_mapreduce_shuffle_class
(string)
                            Shuffle service that needs to be set for Map Reduce applications.

                        
org.apache.hadoop.mapred.ShuffleHandler
webhdfs
(boolean)
                            Hadoop includes a RESTful API over HTTP to HDFS.  Setting this flag
to True enables this part of the HDFS service.

                        
heap
(int)
                            The maximum heap size in MB to allocate for daemons processes within the
service units managed by this charm.
.
The recommended configurations vary based on role and the amount of
raw disk storage available in the hadoop cluster:
.
 * NameNode: 1GB of heap for every 100TB of raw data stored.
 * SecondaryNameNode: Must be paired with the NameNode.
 * JobTracker: 2GB.
 * DataNode: 1GB.
 * TaskTracker: 1GB.
.
The above recommendations are taken from HBase: The Definitive Guide by
Lars George.
.
Obviously you need to ensure that the servers supporting each service unit
have sufficient memory to accomodate this setting - it should be no more
than 75% of the total memory in the system excluding swap.
.
If you are also mixing MapReduce and DFS roles on the same units you need to
take this into account as well (see README for more details).

                        
1024
mapreduce_task_io_sort_factor
(int)
                            More streams merged at once while sorting files.. This
determines the number of open file handles.

                        
10
dfs_block_size
(int)
                            The default block size for new files (default to 64MB).  Increase this in
larger deployments for better large data set performance.

                        
134217728
yarn_nodemanager_aux-services
(string)
                            Shuffle service that needs to be set for Map Reduce applications.

                        
mapreduce_shuffle
mapreduce_task_io_sort_mb
(int)
                            Higher memory-limit while sorting data for efficiency..

                        
100
dfs_namenode_handler_count
(int)
                            The number of server threads for the namenode.  Increase this in larger
deployments to ensure the namenode can cope with the number of datanodes
that it has to deal with.

                        
10
io_file_buffer_size
(int)
                            The size of buffer for use in sequence files. The size of this buffer should
probably be a multiple of hardware page size (4096 on Intel x86), and it
determines how much data is buffered during read and write operations.

                        
4096
dfs_namenode_heartbeat_recheck_interval
(int)
                            Determines datanode recheck heartbeat interval in milliseconds
It is used to calculate the final tineout value for namenode. Calcultion process is    
as follow: 10.30 minutes = 2 x (dfs.namenode.heartbeat.recheck-interval=5*60*1000)
                           + 10 * 1000 * (dfs.heartbeat.interval=3)

                        
300000
dfs_heartbeat_interval
(int)
                            Determines datanode heartbeat interval in seconds.

                        
3