drill hadoop #2

8 machines, 14 units

Apache Drill and Apache Bigtop

Analyse and explore data.

Apache Drill allows you to query a number of less traditional datasources. As detailed above these do not have to be SQL databases and instead might be, CSV files, JSON files, data stored in Hadoop or a combination of all 3 and more. Apache Drill allows users to query multiple data sources as a single entitiy, letting you combine customer data from your CRM with sales exports data stored on a shared file server in a single view. Allowing users the ability to gain better insight into their data than ever before.

This bundle builds on top of the Hadoop processing bundle with additional Drill nodes and HDFS connectivity. This allows you to run SQL queries against files inside of HDFS and scale these queries as your data grows. Apache Drill will execute the query as close to the data as possible, improving performance and amount of data that can be crunched.

Usage

SQL Analysis For Big Data.

This bundle is a full Hadoop deployment with Apache Drill. It is designed to allow easy deployment of a scalable SQL over Hadoop setup. The deployment of this bundle will deploy the following units:

  • 3 Apache Drill
  • 3 Apache Zookeeper
  • 1 Hadoop Namenode
  • 3 Hadoop Slaves
  • 1 Hadoop Resource Manager
  • 1 Ganglia Server
  • 1 Rsyslog Server

Deployment

There are 2 easy ways to deploy this bundle.

GUI

Click the Add to model button at the top of this page. Then the Deploy changes button and follow the on screen instructions.

Command Line

Deploy this bundle using juju:

juju deploy ~spiculecharms/drill-hadoop
juju expose apache-drill

Interacting with the bundle

Getting data into Hadoop

The first task is of course getting some queryable data into Hadoop. To access the HDFS file system you can do so by SSHing into the namenode and using hte hdfs command line tool. For example:

hdfs hadoop fs -put parquet/userdata* /user/ubuntu/sample/
hdfs dfs -ls /user/ubuntu/sample

Querying the data

From the Drill web interface you should then be able to interrogate your data with something like this:

select * from `juju_hdfs_namenode`.`root`.`/sample/`

This will then run your SQL query over your data.

The juju_hdfs_namenode part is the Storage pool name. root is the workspace and the last bit is the directory your data is stored in.

Bundle configuration