apache oozie #1

  • By asanjar
  • Latest version (#1)
  • trusty
  • Stable
  • Edge

Description

Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of actions.
Oozie Coordinator jobs are recurrent Oozie Workflow jobs triggered by
time (frequency) and data availabilty.
Oozie is integrated with the rest of the Hadoop stack supporting several
types of Hadoop jobs out of the box (such as Java map-reduce, Streaming
map-reduce, Pig, Hive, Sqoop and Distcp) as well as system specific jobs.
Oozie is a scalable, reliable and extensible system..


Overview

Data warehouse infrastructure built on top of Hadoop.

Hive 0.11.3 is a data warehouse infrastructure built on top of Hadoop that
provides tools to enable easy data summarization, adhoc querying and
analysis of large datasets data stored in Hadoop files. It provides a
mechanism to put structure on this data and it also provides a simple
query language called Hive QL which is based on SQL and which enables
users familiar with SQL to query this data. At the same time, this
language also allows traditional map/reduce programmers to be able to
plug in their custom mappers and reducers to do more sophisticated
analysis which may not be supported by the built-in capabilities of
the language.

Hive provides:

  • HiveQL - An SQL dialect language for querying data in a RDBMS fashion
  • UDF/UDAF/UDTF (User Defined [Aggregate/Table] Functions) - Allows user to
    create custom Map/Reduce based functions for regular use
  • Ability to do joins (inner/outer/semi) between tables
  • Support (limited) for sub-queries
  • Support for table 'Views'
  • Ability to partition data into Hive partitions or buckets to enable faster
    querying
  • Hive Web Interface - A web interface to Hive
  • Hive Server2 - Supports multi-suer querying using Thrift, JDBC and ODBC clients
  • Hive Metastore - Ability to run a separate Metadata storage process
    -* Hive cli - A Hive commandline that supports HiveQL

See [http://hive.apache.org]http://hive.apache.org) for more information.

This charm provides the Hive Server and Metastore roles which form part of an
overall Hive deployment.

Usage

A Hive deployment consists of a Hive service, a RDBMS (only MySQL is currently
supported), an optional Metastore service and a Hadoop cluster.

To deploy a simple four node Hadoop cluster (see Hadoop charm README for further
information)::

juju deploy hadoop hadoop-master
juju deploy hadoop hadoop-slavecluster
juju add-unit -n 2 hadoop-slavecluster
juju add-relation hadoop-master:namenode hadoop-slavecluster:datanode
juju add-relation hadoop-master:resourcemanager hadoop-slavecluster:nodemanager

A Hive server stores metadata in MySQL::

juju deploy mysql
# hive requires ROW binlog
juju set mysql binlog-format=ROW

To deploy a Hive service without a Metastore service::

# deploy Hive instance (hive-server2)
juju deploy hive2 hive-server 
# associate Hive with MySQL
juju add-relation hive-server:db mysql:db

# associate Hive with HDFS Namenode
juju add-relation hive-server:namenode hadoop-master:namenode
# associate Hive with resourcemanager
juju add-relation hive-server:resourcemanager hadoop-master:resourcemanager

To deploy a Hive service with a Metastore service::

# deploy Metastore instance
juju deploy hive2 hive-metastore
# associate Metastore with MySQL
juju add-relation hive-metastore:db mysql:db

# associate Metastore with Namenode
juju add-relation hive-metastore:namenode hadoop-master:namenode

# deploy Hive instance
juju deploy hive2 hive-server
# associate Hive with Metastore
juju add-relation hive-server:server hive-metastore:metastore
# associate Hive with Namenode
juju add-relation hive-server:namenode hadoop-master:namenode
# associate Hive with resourcemanager
juju add-relation hive-server:resourcemanager hadoop-master:resourcemanager

Further Hive service units may be deployed::

juju add-unit hive-server

TO deploy a Hive service with elasticsearch service::
# deploy ElasticSearch locally:
juju deploy local:elasticsearch elk
juju add-unit -n elk
# deploy hive service by any senarios mentioned above
# associate Hive with elasticsearch
juju add-relation hive-server:elk elk:client

This currently only works when using a Metastore service.

TO deploy a Hive service with elasticsearch service::
# deploy ElasticSearch locally:
juju deploy local:elasticsearch elk
juju add-unit -n elk
# deploy hive service by any senarios mentioned above
# associate Hive with elasticsearch
juju add-relation hive-server:elasticsearch elk:client

Configuration

resources_mirror
(string) URL from which to fetch resources (e.g., Hadoop binaries) instead of Launchpad.
heap
(int) The maximum heap size (in MB) used by the hadoop client jvm. If you experience out of memory (OOM) errors when running jobs, consider increasing this value.
1024