hive

By charmers

  • 908 deploys

Description

Data warehouse infrastructure built on top of Hadoop . Hive is a data warehouse infrastructure built on top of Hadoop that provides tools to enable easy data summarization, adhoc querying and analysis of large datasets data stored in Hadoop files. It provides a mechanism to put structure on this data and it also provides a simple query language called Hive QL which is based on SQL and which enables users familiar with SQL to query this data. At the same time, this language also allows traditional map/reduce programmers to be able to plug in their custom mappers and reducers to do more sophisticated analysis which may not be supported by the built-in capabilities of the language.

Readme

Overview

Data warehouse infrastructure built on top of Hadoop.

Hive is a data warehouse infrastructure built on top of Hadoop that provides tools to enable easy data summarization, adhoc querying and analysis of large datasets data stored in Hadoop files. It provides a mechanism to put structure on this data and it also provides a simple query language called Hive QL which is based on SQL and which enables users familiar with SQL to query this data. At the same time, this language also allows traditional map/reduce programmers to be able to plug in their custom mappers and reducers to do more sophisticated analysis which may not be supported by the built-in capabilities of the language.

Hive provides:

  • HiveQL - An SQL dialect language for querying data in a RDBMS fashion
  • UDF/UDAF/UDTF (User Defined [Aggregate/Table] Functions) - Allows user to create custom Map/Reduce based functions for regular use
  • Ability to do joins (inner/outer/semi) between tables
  • Support (limited) for sub-queries
  • Support for table 'Views'
  • Ability to partition data into Hive partitions or buckets to enable faster querying
  • Hive Web Interface - A web interface to Hive
  • Hive Server - Supports querying using Thrift, JDBC and ODBC clients
  • Hive Metastore - Ability to run a separate Metadata storage process -* Hive cli - A Hive commandline that supports HiveQL

See [http://hive.apache.org]http://hive.apache.org) for more information.

This charm provides the Hive Server and Metastore roles which form part of an overall Hive deployment.

Usage

A Hive deployment consists of a Hive service, a RDBMS (only MySQL is currently supported), an optional Metastore service and a Hadoop cluster.

To deploy a simple four node Hadoop cluster (see Hadoop charm README for further information)::

juju deploy hadoop hadoop-master
juju deploy hadoop hadoop-slavecluster
juju add-unit -n 2 hadoop-slavecluster
juju add-relation hadoop-master:namenode hadoop-slavecluster:datanode
juju add-relation hadoop-master:jobtracker hadoop-slavecluster:tasktracker

A Hive server stores metadata in MySQL::

juju deploy mysql

Please read Known Issues 1. below before proceeding further

To deploy a Hive service without a Metastore service::

# deploy Hive instance
juju deploy hive hive-server
# associate Hive with MySQL
juju add-relation hive-server:db mysql:db

Please read Known Issues 2. below before proceeding further

# associate Hive with HDFS Namenode
juju add-relation hive-server:namenode hadoop-master:namenode
# associate Hive with Jobtracker
juju add-relation hive-server:jobtracker hadoop-master:jobtracker

To deploy a Hive service with a Metastore service::

# deploy Metastore instance
juju deploy hive hive-metastore
# associate Metastore with MySQL
juju add-relation hive-metastore:db mysql:db

Please read Known Issues 2. below before proceeding further

# associate Metastore with Namenode
juju add-relation hive-metastore:namenode hadoop-master:namenode

# deploy Hive instance
juju deploy hive hive-server
# associate Hive with Metastore
juju add-relation hive-server:server hive-metastore:metastore
# associate Hive with Namenode
juju add-relation hive-server:namenode hadoop-master:namenode
# associate Hive with Jobtracker
juju add-relation hive-server:jobtracker hadoop-master:jobtracker

Further Hive service units may be deployed::

juju add-unit hive-server

This currently only works when using a Metastore service.

Known Issues

  1. The current MySQL Charm does not alter the binlog format to 'ROW' (nor can the binlog be disabled) so attempting to use Hive will result in a nasty error message of::

    'Cannot execute statement: impossible to write to binary log since BINLOG_FORMAT = STATEMENT and at least one table uses a storage engine limited to row-based logging. InnoDB is limited to row-logging when transaction isolation level is READ COMMITTED or READ UNCOMMITTED.'

The current solution is to ssh into a MySQL charm machine and manually set the binlog format::

sudo bash -c 'echo "binlog_format = ROW" >> /etc/mysql/conf.d/binlog.cnf'
sudo service mysql restart

This should be done after deploying the MySQL charm.

  1. The current MySQL Charm creates a database using 'utf8' encoding, which will give you the following error with Hive::

    'MetaException(message:Got exception: javax.jdo.JDODataStoreException An exception was thrown while adding/validating class(es) : Specified key was too long; max key length is 767 bytes'

The current solution is drop the database and recreate it (to use 'latin1' encoding)::

mysql -u root -p`cat /var/lib/juju/mysql.passwd` -e 'drop database `<name>`; \
create database `<name>`;'

where '' is the service name, either 'hive-server' or 'hive-metastore'.

This should be done after you have associated Hive with the database.

Configuration

source
(string) Location and packages to install hive: . * dev: Install using the hive packages from ppa:hadoop-ubuntu/dev. * testing: Install using the hive packages from ppa:hadoop-ubuntu/testing. * stable: Install using the hive packages from ppa:hadoop-ubuntu/stable. . The packages provided in the hadoop-ubuntu team PPA's are based directly on upstream hive releases but are not fully built from source.
stable
heap
(int) The maximum heap size in MB to allocate for daemons processes within the service units managed by this charm.
1024