Apache™ Tez is an extensible framework for building YARN based, high performance batch and interactive data processing applications in Hadoop that need to handle TB to PB scale datasets. It allows projects in the Hadoop ecosystem, such as Apache Hive and Apache Pig, as well as 3rd-party software vendors to express fit-to-purpose data processing applications in a way that meets their unique demands for fast response times and extreme throughput at petabyte scale.

What is Tez

Apache Tez, a Framework for YARN-based, Data Processing Applications In Hadoop.

Apache™ Tez is an extensible framework for building YARN based, high performance
batch and interactive data processing applications in Hadoop that need to handle
TB to PB scale datasets. It allows projects in the Hadoop ecosystem, such as
Apache Hive and Apache Pig, as well as 3rd-party software vendors to express
fit-to-purpose data processing applications in a way that meets their unique
demands for fast response times and extreme throughput at petabyte scale.

Why Apache Tez
Apache Tez provides a developer API and framework to write native YARN
applications that bridge the spectrum of interactive and batch workloads.
It allows applications to seamlessly span the scalability dimension from
GB’s to PB’s of data and 10’s to 1000’s of nodes. The Apache Tez component
library allows developers to use Tez to create Hadoop applications that
integrate with YARN and perform well within mixed workload Hadoop clusters.

And, since Tez is extensible and embeddable, it provides the fit-to-purpose
freedom to express highly optimized data processing applications, giving
them an advantage over general-purpose, end-user-facing engines such as
MapReduce and Spark. Finally, it offers a customizable execution architecture
that allows you to express complex computations as dataflow graphs and allows
for dynamic performance optimizations based on real information about the data
and the resources required to process it.

Tez usecase

Verify that your cluster meets the following pre-requisites before installing Tez:
Apache Hadoop 2.4.x & YARN

To deploy a four node Hadoop cluster

juju deploy hdp-hadoop yarn-hdfs-master
juju deploy hdp-hadoop compute-node
juju add-unit -n 2 compute-node
juju add-relation yarn-hdfs-master:namenode compute-node:datanode
juju add-relation yarn-hdfs-master:resourcemanager compute-node:nodemanager

To deploy a Tez Client::

juju deploy hdp-tez hdp-tez
juju add-relation hdp-tez:resourcemanager yarn-hdfs-master:resourcemanager
juju add-relation hdp-tez:namenode yarn-hdfs-master:namenode
juju add-relation hdp-tez:hadoop-nodes compute-node:hadoop-nodes

TEZ scale

juju add-unit -n 2 compute-node

Verify deployment

execute:

$juju run "sudo su hdfs -c 'hdfs dfs -ls /apps/tez'" --unit hdp-tez/0

A successful result:

 hdfs users   ...  /apps/tez/conf
 hdfs users   ...  /apps/tez/lib
 hdfs users   ...  /apps/tez/tez-api-0.4.0.2.1.3.0-563.jar
 hdfs users   ...  /apps/tez/tez-common-0.4.0.2.1.3.0-563.jar
 hdfs users   ...  /apps/tez/tez-dag-0.4.0.2.1.3.0-563.jar
 hdfs users   ...  /apps/tez/tez-mapreduce-0.4.0.2.1.3.0-563.jar
 hdfs users   ...  /apps/tez/tez-mapreduce-examples-0.4.0.2.1.3.0-563.jar
 hdfs users   ...  /apps/tez/tez-runtime-internals-0.4.0.2.1.3.0-563.jar
 hdfs users   ...  /apps/tez/tez-runtime-library-0.4.0.2.1.3.0-563.jar
 hdfs users   ...  /apps/tez/tez-tests-0.4.0.2.1.3.0-563.jar

HDFS validation from Tez Client
1) Remote HDFS Cluster health

 $juju run "su hdfs -c 'hdfs dfsadmin -report '" --unit hdp-tez/0
** validate the returned information **

2) Validate a successful create directory on hdfs cluster

$juju run "su hdfs -c 'hdfs dfs -mkdir /tmp'" --unit hdp-tez/0

3) Copy a test data file to hdfs cluster

$juju run "su hdfs -c 'hdfs dfs -put /home/ubuntu/pg4300.txt /tmp '" --unit hdp-tez/0

4) Run Tez world-count example -

$ juju run "/home/ubuntu/runtez_wc.sh" --unit hdp-tez/0

5) View the result save on hdfs cluster:

$juju run "su hdfs -c 'hdfs dfs -cat /tmp/pg4300.out/* '" --unit hdp-tez/0

Tez Contact Information

Amir Sanjar amir.sanjar@canonical.com

Hortonowrks TezUpstream Project Name