How I build, deploy, and run Spark
I’m currently tinkering with Spark for my side project JourneyMonitor. The goal is to extract useful metrics from the Selenium runs executed by the platform.
To do so, I’m currently in the process of creating a new Analyze component. I want to build the Spark setup and the jobs using Scala 2.11. Therefore, I had to compile my own version of Spark 1.5.1, put it onto the systems, and run a cluster from that. This post describes what worked for me.
As of this writing, there is no pre-built version of Spark 1.5.1 for Scala 2.11. Therefore, I downloaded the Source Code package from the official Spark download page and unpacked it, which resulted in folder spark-1.5.1.
The version of Java installed on my Mac OS X 10.11 (El Capitan) system is the official Oracle Standard Edition Java version 1.8.0_25-b17. I had to manually set the JAVA_HOME environment variable in my bash session via
Spark builds via Maven, thus I installed that using Homebrew via
brew install maven.
Afterwards, I changed into the spark-1.5.1 folder and ran
./dev/change-scala-version.sh 2.11. I was then able to start building my own Spark 1.5.1 package for Scala 2.11 by running:
./make-distribution.sh --name hadoop-2.6_scala-2.11 --tgz -Pyarn -Phadoop-2.6 -Dscala-2.11 -DskipTests
This took a while and resulted in file spark-1.5.1-bin-hadoop-2.6_scala-2.11.tgz in the spark-1.5.1 root folder. It contains a fully self-contained installation of Spark for Scala 2.11 including helper scripts and such – I’ve made the archive available at the JourneyMonitor infra-artifacts repository on Bintray. Installing Spark now was simply a matter of extracting the archive to the desired target folder.
From this folder, it’s now simple to start up a local cluster with a master process and one worker (or slave) process:
./sbin/start-master.sh --host 127.0.0.1 ./sbin/start-slave.sh spark://127.0.0.1:7077
In order to test-run a Spark application on this cluster, I created a simple Spark app as described in the Spark Quick Start guide, but with two changes: in build.sbt, I changed scalaVersion to 2.11.7, and added
% "provided" to the end of the libraryDependencies line, because when a Spark application runs on a Spark cluster, the cluster provides the spark-core dependency, and there is no need to integrate it into the package of the application itself.
You can view the SimpleApp source code on the JourneyMonitor analyze repository on GitHub.
In order to create the SimpleApp jar, simply run
sbt package in the root folder of the app project folder. This results in file target/scala-2.11/spark-test_2.11-1.0.jar. This jar file can now be run on the cluster:
./bin/spark-submit --deploy-mode cluster --master spark://127.0.0.1:6066 PATH/TO/SIMPLEAPP/target/scala-2.11/spark-test_2.11-1.0.jar
Because the app is run on the cluster, its output is not shown on the command line. Instead, you have to visit the web UI of the worker node at http://localhost:8081/. In the Finished Drivers section you’ll find an entry for each job run that has been submitted, and the stdout link for a job run shows the output of the app.
In order to deploy the Spark package to the JourneyMonitor systems and run master and worker instances, I’m currently building a Puppet module which is available as part of the JourneyMonitor infra repository on GitHub.
I would like to make clear that I’m still very much a beginner in this area, so don’t take this guide as a “best practice” approach. I would love to hear feedback from folks with more experience in the comments section.
For several month now, a small Spark cluster is up and running at the JourneyMonitor project. I currently can’t spare the time for a write-up, but have a look at the Puppet manifests for the Spark Master and for the Spark Slaves for some inspiration on how to get the Spark package that has been build in the course of this post up and running.