Hadoop is a software platform that lets one easily write and run applications that process vast amounts of data. . Here's what makes Hadoop especially useful: . * Scalable: Hadoop can reliably store and process petabytes. * Economical: It distributes the data and processing across clusters of commonly available computers. These clusters can number into the thousands of nodes. * Efficient: By distributing the data, Hadoop can process it in parallel on the nodes where the data is located. This makes it extremely rapid. * Reliable: Hadoop automatically maintains multiple copies of data and automatically redeploys computing tasks based on failures. . Hadoop implements MapReduce, using the Hadoop Distributed File System (HDFS). MapReduce divides applications into many small blocks of work. HDFS creates multiple replicas of data blocks for reliability, placing them on compute nodes around the cluster. MapReduce can then process the data where it is located.
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model.
It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-avaiability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-availabile service on top of a cluster of computers, each of which may be prone to failures.
Hadoop consists of the following two core components:
Hadoop Distributed File System (HDFS™) is the primary storage system used by Hadoop applications. HDFS creates multiple replicas of data blocks and distributes them on compute nodes throughout a cluster to enable reliable, extremely rapid computations.
Hadoop MapReduce is a programming model and software framework for writing applications that rapidly process vast amounts of data in parallel on large clusters of compute nodes.
This charm supports the following Hadoop roles:
- HDFS: namenode, secondarynamenode and datanode
- MapReduce: jobtracker, tasktracker
This supports deployments of Hadoop in a number of configurations.
Combined HDFS and MapReduce
In this configuration, the MapReduce jobtracker is deployed on the same service units as HDFS namenode and the HDFS datanodes also run MapReduce tasktrackers::
juju deploy hadoop hadoop-master juju deploy hadoop hadoop-slavecluster juju add-unit -n 2 hadoop-slavecluster juju add-relation hadoop-master:namenode hadoop-slavecluster:datanode juju add-relation hadoop-master:jobtracker hadoop-slavecluster:tasktracker
Separate HDFS and MapReduce
In this configuration the HDFS and MapReduce deployments operate on different service units as separate services::
juju deploy hadoop hdfs-namenode juju deploy hadoop hdfs-datacluster juju add-unit -n 2 hdfs-datacluster juju add-relation hdfs-namenode:namenode hdfs-datacluster:datanode juju deploy hadoop mapred-jobtracker juju deploy hadoop mapred-taskcluster juju add-unit -n 2 mapred-taskcluster juju add-relation mapred-jobtracker:mapred-namenode hdfs-namenode:namenode juju add-relation mapred-taskcluster:mapred-namenode hdfs-namenode:namenode juju add-relation mapred-jobtracker:jobtracker mapred-taskcluster:tasktracker
In the long term juju should support improved placement of services to better support this type of deployment. This would allow mapreduce services to be deployed onto machines with more processing power and hdfs services to be deployed onto machines with larger storage.
HDFS with HBase
This charm also supports deployment of HBase; HBase requires that append mode is enabled in DFS - this can be set by providing a config.yaml file::
hdfs-namenode: hbase: true hdfs-datacluster: hbase: true
Its really important to ensure that both the master and the slave services have the same configuration in this deployment scenario.
The charm can then be use to deploy services with this configuration::
juju deploy --config config.yaml hadoop hdfs-namenode juju deploy --config config.yaml hadoop hdfs-datacluster juju add-unit -n 2 hdfs-datacluster juju add-relation hdfs-namenode:namenode hdfs-datacluster:datanode
You can then associate a hdfs service deployment with a hbase service deployment::
juju add-relation hdfs-namenode:namenode hbase-master:namenode juju add-relation hdfs-namenode:namenode hbase-regioncluster:namenode juju add-relation hdfs-namenode:namenode hbase-datacluster:namenode
See the hbase charm for more details on deploying HBase.
Words of Caution
Note that removing the relation between namenode and datanode is destructive! The role of the service is determined at the point that the relation is added (it must be qualified) and CANNOT be changed later!
A single hdfs-master can support multiple slave service deployments::
juju deploy hadoop hdfs-datacluster-02 juju add-unit -n 2 hdfs-datacluster-02 juju add-relation hdfs-namenode:namenode hdfs-datacluster-02:datanode
This could potentially be used to perform charm upgrades on datanodes in sets::
juju upgrade-charm hdfs-datacluster
Go and make some tea whilst monitoring juju debug-log)
juju upgrade-charm hdfs-datacluster-02
Could be helpful to avoid outages (to be proven).
- (int) The size of buffer for use in sequence files. The size of this buffer should probably be a multiple of hardware page size (4096 on Intel x86), and it determines how much data is buffered during read and write operations.
- (int) The number of worker threads that for the http server. This is used for map output fetching.
- (string) Java opts for the task tracker child processes. The following symbol, if present, will be interpolated: @taskid@ is replaced by current TaskID. Any other occurrences of '@' will go unchanged. For example, to enable verbose gc logging to a file named for the taskid in /tmp and to set the heap maximum to be a gigabyte, pass a 'value' of: . -Xmx1024m -verbose:gc -Xloggc:/tmp/@taskid@.gc . The configuration variable mapred.child.ulimit can be used to control the maximum virtual memory of the child processes.
- (string) The directory under which all other hadoop data is stored. Use this to take advantage of extra storage that might be avaliable. . You can change this in a running deployment but all existing data in HDFS will be inaccessible; you can of course switch it back if you do this by mistake.
- (int) The number of streams to merge at once while sorting files. This determines the number of open file handles.
- (int) The number of files that an datanode will serve at any one time. . An Hadoop HDFS datanode has an upper bound on the number of files that it will serve at any one time. This defaults to 256 (which is low) in hadoop 1.x - however this charm increases that to 4096.
- (int) The total amount of buffer memory to use while sorting files, in megabytes. By default, gives each merge stream 1MB, which should minimize seeks.
- (boolean) To install Apache Pig on all service units alongside Hadoop set this configuration to 'True'. . Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.
- (int) The default number of parallel transfers run by reduce during the copy(shuffle) phase.
- (string) Location and packages to install hadoop: . * dev: Install using the hadoop packages from ppa:hadoop-ubuntu/dev. * testing: Install using the hadoop packages from ppa:hadoop-ubuntu/testing. * stable: Install using the hadoop packages from ppa:hadoop-ubuntu/stable. . The packages provided in the hadoop-ubuntu team PPA's are based directly on upstream hadoop releases but are not fully built from source.
- (int) The number of server threads for the JobTracker. This should be roughly 4% of the number of tasktracker nodes.
- (int) The maximum heap size in MB to allocate for daemons processes within the service units managed by this charm. . The recommended configurations vary based on role and the amount of raw disk storage available in the hadoop cluster: . * NameNode: 1GB of heap for every 100TB of raw data stored. * SecondaryNameNode: Must be paired with the NameNode. * JobTracker: 2GB. * DataNode: 1GB. * TaskTracker: 1GB. . The above recommendations are taken from HBase: The Definitive Guide by Lars George. . Obviously you need to ensure that the servers supporting each service unit have sufficient memory to accomodate this setting - it should be no more than 75% of the total memory in the system excluding swap. . If you are also mixing MapReduce and DFS roles on the same units you need to take this into account as well (see README for more details).
- (int) The number of server threads for the namenode. Increase this in larger deployments to ensure the namenode can cope with the number of datanodes that it has to deal with.
- (boolean) Hadoop includes a RESTful API over HTTP to HDFS. Setting this flag to True enables this part of the HDFS service.
- (boolean) Setting this configuration parameter to 'True' configures HDFS for use with HBase including turning on 'append' mode which is not desirable in all deployment scenarios.
- (int) The default block size for new files (default to 64MB). Increase this in larger deployments for better large data set performance.