Apache Spark™ is a fast and general purpose engine for large-scale data processing.
Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.
Spark has an advanced DAG execution engine that supports cyclic data flow and in-memory computing.
Ease of Use:
Write applications quickly in Java, Scala or Python.
Spark offers over 80 high-level operators that make it easy to build parallel apps. And you can use it interactively from the Scala and Python shells.
General Purpose Engine:
Combine SQL, streaming, and complex analytics.
Spark powers a stack of high-level tools including Shark for SQL, MLlib for machine learning, GraphX, and Spark Streaming. You can combine these frameworks seamlessly in the same application.
Integrated with other Cluster Managers: YARN, EC2, Mesos
Spark can run on Hadoop 2's YARN cluster manager, and can read any existing Hadoop data.
If you have a Hadoop 2 cluster, you can run Spark without any installation needed. Otherwise, Spark is easy to run standalone or on EC2 or Mesos. It can read from HDFS, HBase, Cassandra, and any Hadoop data source.
For more details http://spark.apache.org/
Spark Standalone cluster deployment
One Spark Master node and 1 to many Spark slave nodes:
juju deploy spark spark-master
juju deploy spark spark-slave
juju add-relation spark-master:master spark-slave:slave
Smoke tests after deployment
# Spark admins use ssh to access spark console from master node 1) juju ssh spark-master/0 <<= ssh to spark master 2) Use spark-submit to run your application: spark-submit --class org.apache.spark.examples.SparkPi /usr/lib/spark/lib/spark-examples-1.0.0-hadoop2.2.0.jar you should get pi = 3.14 3) Spark’s shell provides a simple way to learn the API, as well as a powerful tool to analyze data interactively. It is available in either Scala or Python. Start it by running the following in the Spark directory: $spark-shell <== for interaction using scala $pyspark <== for interaction using python
From Master node only
Select one of following spark cluster management operation, must type "none" for no operation:
none, spark-submit, start-master, start-slaves, start-all, stop-master, stop-slaves, stop-all.
* none - do nothing
* spark-submit - submit a spark job to spark cluster. MUST have "spark_job_class" and "application_jar"
* start-master - Starts master on master node.
* start-slaves - Starts a slave instance on each slave node.
* start-all - Starts both master and all slaves.
* stop-master - Stops the master.
* stop-slaves - Stops all slave instances on the nodes registered to master node.
* stop-all - Stops both the master and the slaves nodes.
Amount of memory to use per executor process, in the same format as
JVM memory strings-512m is default (e.g. 512m, 2g).
Number of cores avaliable to executor process
The entry point for your application (e.g. org.apache.spark.examples.SparkPi)
Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an hdfs:// path or a file:// path that is present on all nodes (i.e /usr/lib/spark/lib/spark-examples-1.0.0-hadoop2.2.0.jar).
Arguments passed to the main method of your main class, if any
spark_serializer (currently disabled)
Spark can also use the Kryo library (version 2) to serialize objects more quickly
amir sanjar firstname.lastname@example.org
Upstream Project Name
loging levels: ALL, DEBUG, INFO, WARN, ERROR, FATAL, OFF
- WARN, console
Number of cores available to executor process
Amount of memory to use per executor process, in the same format as JVM memory strings (e.g. 512m, 2g).
Spark can also use the Kryo library (version 2) to serialize objects more quickly. Kryo is significantly faster and more compact than Java serialization (often as much as 10x), but does not support all Serializable types and requires you to register the classes you’ll use in the program in advance for best performance.