5 machines, 5 units
Apache Spark™ is a fast and general purpose engine for large-scale data processing.
The IPython Notebook is an interactive computational environment, in which you
can combine code execution, rich text, mathematics, plots and rich media.
Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.
Spark has an advanced DAG execution engine that supports cyclic data flow and in-memory computing.
Ease of Use:
Write applications quickly in Java, Scala or Python.
Spark offers over 80 high-level operators that make it easy to build parallel apps.
And you can use it interactively from the Scala and Python shells.
General Purpose Engine:
Combine SQL, streaming, and complex analytics.
Spark powers a stack of high-level tools including Shark for SQL, MLlib for
machine learning, GraphX, and Spark Streaming. You can combine these frameworks
seamlessly in the same application.
from bundle's home directory:
juju quickstart bundles.yaml
In order to increase the amount of spark slaves, you just add units, to add one
unit to spark-slave nodes (current bundle has 4 spark-slave):
juju add-unit -n4 spark-slave
# Spark admins use ssh to access spark console from master node
1) juju ssh spark-master/0 <<= ssh to spark master
2) Use spark-submit to run your application:
spark-submit --class org.apache.spark.examples.SparkPi /usr/lib/spark/lib/spark-examples*.jar 10
you should get pi = 3.14
or execute demo.sh from /home/ubuntu
3) Spark’s shell provides a simple way to learn the API, as well as a powerful
tool to analyze data interactively. It is available in either Scala or Python.
Start it by running the following in the Spark directory:
$spark-shell <== for interaction using scala
$pyspark <== for interaction using python