Apache™ Pig allows you to write complex MapReduce transformations using a
simple scripting language. Pig Latin (the language) defines a set of
transformations on a data set such as aggregate, join and sort.
Pig translates the Pig Latin script into MapReduce so that it can be executed
within Hadoop®. Pig Latin is sometimes extended using UDFs
(User Defined Functions), which the user can write in Java or a scripting
language and then call directly from the Pig Latin

Hortonworks Pig overview

Hortonworks HDP 2.1 Apache Pig is a platform for analyzing large data sets that
consists of a high-level language for expressing data analysis programs, coupled
with infrastructure for evaluating these programs. The salient property of Pig
programs is that their structure is amenable to substantial parallelization,
which in turns enables them to handle very large data sets.
At the present time, Pig's infrastructure layer consists of a compiler that
produces sequences of Map-Reduce programs, for which large-scale parallel
implementations already exist (e.g., the Hadoop subproject). Pig's language
layer currently consists of a textual language called Pig Latin, which has the
following key properties:

  • Ease of programming. It is trivial to achieve parallel execution of simple,
    "embarrassingly parallel" data analysis tasks. Complex tasks comprised of
    multiple interrelated data transformations are explicitly encoded as data
    flow sequences, making them easy to write, understand, and maintain.
  • Optimization opportunities. The way in which tasks are encoded permits the
    system to optimize their execution automatically, allowing the user to focus
    on semantics rather than efficiency.
  • Extensibility. Users can create their own functions to do special-purpose
    processing.

Pig has two execution modes or exectypes:
- Local Mode - To run Pig in local mode, you need access to a single machine;
all files are installed and run using your local host and file system. Specify
local mode using the -x flag (pig -x local).
- Mapreduce Mode - To run Pig in mapreduce mode, you need access to a Hadoop
cluster
and HDFS installation. Mapreduce mode is the default mode; you can, but don't
need to, specify it using the -x flag (pig OR pig -x mapreduce).

This charm provides Pig client with both execution modes (above).

Hortonworks Pig usage

Step-by-step instructions on using the charm:
Local Mode
juju deploy hdp-pig hdp-pig

**Mapreduce Mode - remote hadoop cluster**
  - Install Hadoop HDP 2.1 cluster
  juju deploy hdp-hadoop yarn-hdfs-master
  juju deploy hdp-hadoop compute-node
  juju add-unit -n 2 compute-node
  juju add-relation yarn-hdfs-master:namenode compute-node:datanode
  juju add-relation yarn-hdfs-master:resourcemanager compute-node:nodemanager

  - Install HDP Pig
  juju deploy hdp-pig hdp-pig
  juju add-relation hdp-pig:namenode yarn-hdfs-master:namenode
  juju add-relation hdp-pig:resourcemanager yarn-hdfs-master:resourcemanager

Smoke test local mode deployment:
1) pig -x local

Smoke test mapreduce deployment:
Verify connections to remote cluster:
1) juju ssh hdp-pig
2) sudo su $HDFS_USER
3) hadoop version <= verifies if hadoop client is installed
4) hdfs dfsadmin -report <= verifies if Pig client has been connected to the
remote HDFS server
5) yarn rmadmin -getGroups <= verifies if Pig client has been connected to the
remote ResourceManager server
Run a Pig Script Test:
1) hdfs dfs -mkdir -p /user/hduser
2) hdfs dfs -copyFromLocal /etc/passwd /user/hduser/passwd
3) vim /tmp/id.pig
4) add following Pig script commands, save and exit:
A = load '/user/hduser/passwd' using PigStorage(':');
B = foreach A generate \$0 as id; store B into '/tmp/id.out';
5) pig -l /tmp/pig.log /tmp/id.pig
6) hadoop fs -cat /tmp/id.out/part-m-00000 <= check the result on the
hadoop cluster

Developer Contact Information

amir sanjar amir.sanjar@canonical.com

Upstream Hortonworks Links