apache hadoop

  • By asanjar
  • Latest version (#0)
  • trusty
  • Stable
  • Edge

Description

Hadoop is a software platform that lets one easily write and
run applications that process vast amounts of data.

Overview

What is Apache Hadoop?
The Apache Hadoop software library is a framework that allows for the
distributed processing of large data sets across clusters of computers
using a simple programming model.

It is designed to scale up from single servers to thousands of machines,
each offering local computation and storage. Rather than rely on hardware
to deliver high-avaiability, the library itself is designed to detect
and handle failures at the application layer, so delivering a
highly-availabile service on top of a cluster of computers, each of
which may be prone to failures.

Apache Hadoop 2.4.1 consists of significant improvements over the previous stable
release (hadoop-1.x).

Here is a short overview of the improvments to both HDFS and MapReduce.

  • HDFS Federation In order to scale the name service horizontally, federation uses multiple independent
    Namenodes/Namespaces. The Namenodes are federated, that is, the
    Namenodes are independent and don't require coordination with each
    other. The datanodes are used as common storage for blocks by all
    the Namenodes. Each datanode registers with all the Namenodes in the
    cluster. Datanodes send periodic heartbeats and block reports and
    handles commands from the Namenodes.

    More details are available in the HDFS Federation document:
    http://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-hdfs/Federation.html

  • MapReduce NextGen aka YARN aka MRv2
    The new architecture introduced in hadoop-0.23, divides the two major functions of the
    JobTracker: resource management and job life-cycle management into separate components.
    The new ResourceManager manages the global assignment of compute resources to
    applications and the per-application ApplicationMaster manages the application‚
    scheduling and coordination.
    An application is either a single job in the sense of classic MapReduce jobs or a DAG of
    such jobs.

The ResourceManager and per-machine NodeManager daemon, which manages the user processes on that machine, form the computation fabric.

The per-application ApplicationMaster is, in effect, a framework specific library and is
tasked with negotiating resources from the ResourceManager and working with the NodeManager (s) to execute and monitor the tasks.

More details are available in the YARN document:
http://hadoop.apache.org/docs/r2.4.1/hadoop-yarn/hadoop-yarn-site/YARN.html

Usage

This charm supports the following Hadoop roles:

  • HDFS: namenode and datanode
  • YARN: ResourceManager, JobHistory, and NodeManager

This supports deployments of Hadoop in a number of configurations.

Simple Usage: Combined HDFS and YARN

In this configuration, the YARN ResourceManager is deployed on the same
service units as HDFS namenode and the HDFS datanodes also run YARN NodeManager::

juju deploy hadoop yarn-hdfs-master
juju deploy hadoop compute-nodes
juju add-relation yarn-hdfs-master:namenode compute-nodes:datanode
juju add-relation yarn-hdfs-master:resourcemanager compute-nodes:nodemanager
juju add-unit -n 3 compute-nodes

Scale Out Usage: Separate HDFS and YARN

In this configuration the HDFS and YARN deployments operate on
different service units as separate services::

juju deploy hadoop hdfs-master-server
juju deploy hadoop compute-nodes
juju deploy hadoop yarn-master-server
juju deploy hadoop hadoop-client
juju add-unit -n 3 compute-nodes
juju add-relation hdfs-master-server:namenode compute-nodes:datanode
juju add-relation yarn-master-server:mapred-namenode hdfs-master-server:namenode
juju add-relation compute-nodes:mapred-namenode hdfs-master-server:namenode    
juju add-relation yarn-master-server:resourcemanager compute-nodes:nodemanager
juju add-relation hadoop-client:mapred-namenode hdfs-master-server:namenode 
juju add-relation yarn-master-server:resourcemanager hadoop-client:nodemanager

In the long term juju should support improved placement of services to
better support this type of deployment. This would allow mapreduce services
to be deployed onto machines with more processing power and hdfs services
to be deployed onto machines with larger storage.

TO deploy a Hadoop service with elasticsearch service::

# deploy ElasticSearch locally:
**juju deploy elasticsearch elasticsearch**
# elasticsearch-hadoop.jar file will be added to LIBJARS path 
# Recommanded to use hadoop -libjars option to included elk jar file
**juju add-unit -n elasticsearch**
# deploy hive service by any senarios mentioned above
# associate Hive with elasticsearch
**juju add-relation {hadoop master}:elasticsearch elasticsearch:client**

Contact Information

amir sanjar amir.sanjar@canonical.com

Hadoop

Configuration

yarn_nodemanager_aux-services
(string) Shuffle service that needs to be set for Map Reduce applications.
mapreduce_shuffle
dfs_datanode_max_xcievers
(int) The number of files that an datanode will serve at any one time. An Hadoop HDFS datanode has an upper bound on the number of files that it will serve at any one time. This defaults to 256 (which is low) in hadoop 1.x - however this charm increases that to 4096.
4096
tasktracker_http_threads
(int) The number of worker threads that for the http server. This is used for map output fetching.
40
dfs_block_size
(int) The default block size for new files (default to 64MB). Increase this in larger deployments for better large data set performance.
134217728
yarn_local_log_dir
(string) Space separated list of directories where YARN will store container log data.
/grid/hadoop/yarn/local
yarn_nodemanager_aux-services_mapreduce_shuffle_class
(string) Shuffle service that needs to be set for Map Reduce applications.
org.apache.hadoop.mapred.ShuffleHandler
mapreduce_reduce_shuffle_parallelcopies
(int) The default number of parallel transfers run by reduce during the copy(shuffle) phase.
5
mapreduce_task_io_sort_mb
(int) Higher memory-limit while sorting data for efficiency..
100
dfs_namenode_handler_count
(int) The number of server threads for the namenode. Increase this in larger deployments to ensure the namenode can cope with the number of datanodes that it has to deal with.
10
mapreduce_task_io_sort_factor
(int) More streams merged at once while sorting files.. This determines the number of open file handles.
10
io_file_buffer_size
(int) The size of buffer for use in sequence files. The size of this buffer should probably be a multiple of hardware page size (4096 on Intel x86), and it determines how much data is buffered during read and write operations.
4096
hadoop_dir_base
(string) The directory under which all other hadoop data is stored. Use this to take advantage of extra storage that might be avaliable. . You can change this in a running deployment but all existing data in HDFS will be inaccessible; you can of course switch it back if you do this by mistake.
/usr/local/hadoop/data
mapred_job_tracker_handler_count
(int) The number of server threads for the JobTracker. This should be roughly 4% of the number of tasktracker nodes.
10
hdfs_log_dir
(string) Directory for storing the HDFS logs
/var/log/hadoop/hdfs
mapred_child_java_opts
(string) Java opts for the task tracker child processes. The following symbol, if present, will be interpolated: @taskid@ is replaced by current TaskID. Any other occurrences of '@' will go unchanged. For example, to enable verbose gc logging to a file named for the taskid in /tmp and to set the heap maximum to be a gigabyte, pass a 'value' of: . -Xmx1024m -verbose:gc -Xloggc:/tmp/@taskid@.gc . The configuration variable mapred.child.ulimit can be used to control the maximum virtual memory of the child processes.
-Xmx200m
mapreduce_framework_name
(string) Execution framework set to Hadoop YARN.** DO NOT CHANGE **
yarn
dfs_namenode_heartbeat_recheck_interval
(int) Determines datanode recheck heartbeat interval in milliseconds It is used to calculate the final tineout value for namenode. Calcultion process is as follow: 10.30 minutes = 2 x (dfs.namenode.heartbeat.recheck-interval=5*60*1000) + 10 * 1000 * (dfs.heartbeat.interval=3)
300000
yarn_local_dir
(string) Space separated list of directories where YARN will store temporary data..
/grid/hadoop/yarn/local
dfs_replication
(int) Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time
3
dfs_heartbeat_interval
(int) Determines datanode heartbeat interval in seconds.
3
yarn_log_dir
(string) Directory for storing the YARN logs
/var/log/hadoop/yarn