Data warehouse infrastructure built on top of Hadoop
.
Hive is a data warehouse infrastructure built on top of Hadoop that
provides tools to enable easy data summarization, adhoc querying and
analysis of large datasets data stored in Hadoop files. It provides a
mechanism to put structure on this data and it also provides a simple
query language called Hive QL which is based on SQL and which enables
users familiar with SQL to query this data. At the same time, this
language also allows traditional map/reduce programmers to be able to
plug in their custom mappers and reducers to do more sophisticated
analysis which may not be supported by the built-in capabilities of
the language.

Hortonworks HIVE Overview

Data warehouse infrastructure built on top of Hortonwork Apache HIVE.

Hortonworks Apache Hive 0.12.x is a data warehouse infrastructure built
on top of Hortonworks Hadoop 2.4.1 that
provides tools to enable easy data summarization, adhoc querying and
analysis of large datasets data stored in Hadoop files. It provides a
mechanism to put structure on this data and it also provides a simple
query language called Hive QL which is based on SQL and which enables
users familiar with SQL to query this data. At the same time, this
language also allows traditional map/reduce programmers to be able to
plug in their custom mappers and reducers to do more sophisticated
analysis which may not be supported by the built-in capabilities of
the language.

Hive provides:

  • HiveQL - An SQL dialect language for querying data in a RDBMS fashion
  • UDF/UDAF/UDTF (User Defined [Aggregate/Table] Functions) - Allows user to
    create custom Map/Reduce based functions for regular use
  • Ability to do joins (inner/outer/semi) between tables
  • Support (limited) for sub-queries
  • Support for table 'Views'
  • Ability to partition data into Hive partitions or buckets to enable faster
    querying
  • Hive Web Interface - A web interface to Hive
  • Hive Server2 - Supports multi-suer querying using Thrift, JDBC and ODBC clients
  • Hive Metastore - Ability to run a separate Metadata storage process
    -* Hive cli - A Hive commandline that supports HiveQL

See [http://hive.apache.org]http://hive.apache.org) for more information.

This charm provides the Hive Server and Metastore roles which form part of an
overall Hive deployment.

Hortonworks HIVE Usage

A Hive deployment consists of a Hive service, a RDBMS (only MySQL is currently
supported), an optional Metastore service and a Hadoop cluster.

To deploy a simple four node Hadoop cluster (see Hadoop charm README for further
information)::
juju deploy hdp-hadoop yarn-hdfs-master
juju deploy hdp-hadoop compute-node
juju add-unit -n 2 compute-node
juju add-relation yarn-hdfs-master:namenode compute-node:datanode
juju add-relation yarn-hdfs-master:resourcemanager compute-node:nodemanager

A Hive server stores metadata in MySQL::

juju deploy mysql
# hive requires ROW binlog
juju set mysql binlog-format=ROW

To deploy a Hive service without a Metastore service::

# deploy Hive instance (hive-server2)
juju deploy hdp-hive hdphive 
# associate Hive with MySQL
juju add-relation hdphive:db mysql:db

# associate Hive with HDFS Namenode
juju add-relation hdphive:namenode yarn-hdfs-master:namenode
# associate Hive with resourcemanager
juju add-relation hdphive:resourcemanager yarn-hdfs-master:resourcemanager

Smoke Test

Usage

Once you have a cluster running, just run:
1) juju ssh yarn-hdfs-master/0 <<= ssh to hadoop master
2) Smoke test HDFS admin functionality- As the HDFS user, create a /user/$CLIENT_USER in
hadoop file system - Below steps verifies/demos HDFS functionality
a) sudo su $HDFS_USER
b) hdfs dfs -mkdir -p /user/ubuntu
c) hdfs dfs -chown ubuntu:ubuntu /user/ubuntu
d) hdfs dfs -chmod -R 755 /user/ubuntu
e) exit

3) Smoke test YARN and Mapreduce - Run the smoke test as the $CLIENT_USER, 
   using Terasort and sort 10GB of data.
   a) hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples-*.jar 
      teragen 10000 /user/ubuntu/teragenout 
   b) hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples-*.jar 
      terasort /user/ubuntu/teragenout /user/ubuntu/terasortout

4) Smoke test HDFS funtionality from ubuntu user space - delete mapreduce 
   output from hdfs 
   hdfs dfs -rm -r /user/ubuntu/teragenout

HIVE+HDFS Usage:
1) juju ssh hdphive/0  <<= ssh to hive server
2) sudo su $HIVE_USER
3) hive
4) from Hive console:
   show databases;
   create table test(col1 int, col2 string);
   show tables;
   exit;
5) exit from $HIVE_USER session
6) sudo su $HDFS_USER
7) hadoop dfsadmin -report <<== verify connection to the remote HDFS cluster
8) hdfs dfs -ls /apps/hive/warehouse <<== verify that "test" directory has 
been created on the remote HDFS cluster

Scale Out Usage

In order to increase the amount of slaves, you must add units, to add one unit:
juju add-unit compute-node
Or you can add multiple units at once:
juju add-unit -n4 compute-node

Contact Information

amir sanjar amir.sanjar@canonical.com

Upstream Project Name

Configuration

heap
(int)
                            The maximum heap size in MB to allocate for daemons processes within the
service units managed by this charm.

                        
1024