Lifting the Fog of Spark Adoption
Clients are often confused about Apache Spark, and this confusion sometimes hinders its adoption. The confusion is not about the features of Spark per se, but about installing and running the big data framework.
One client was convinced that they needed MapR M5 to even make use of Spark and they were really confused on how it runs on the cluster, thinking multiple Spark jobs interacted directly. To help illustrate the flexibility with deploying Spark, I explained the following to the client.
First, consider two users. One has been working with “SparkSQL” via the spark-sql command line for some time, but the other user wants to use the latest MLLib features and submit Spark jobs to execute on the cluster.
The first user is a business analyst who’s knowledgeable in SQL, and wants to do sanity checks on the data. Spark is “installed” at opt/sparks/spark-1.3.1, on the edge node of a Hadoop cluster, and he is happy with the older version of Spark (version 1.3.1), because he just wants to write SQL and get faster results than what Hive with MapReduce would provide. He runs the spark-sql script that resides in the Spark installation’s bin directory. The user would get a command-line interface similar to Hive’s CLI and would run SQL queries.
The second user is a savvy data scientist looking for the newest way to get insights out of the data. This user wants the latest, greatest machine learning library in Spark version 1.6.0 to run fancy statistical models. Again, Spark is “installed” opt/sparks/spark-1.6.0, on the edge node. The data scientist writes Scala and compiles code into jars that are submitted to the cluster and run for some time. She submits her job to the YARN cluster using the spark-submit script.
Both users might be using the same source data. Their two YARN applications will be run separately and might run under different constraints (depending on how the cluster scheduler is configured). The Spark job would be submitted to the cluster via YARN. Under the hood, spark-submit transfers the necessary jars to the cluster to run the YARN application.
I used quotes when I wrote Spark is “installed” because you don’t really “install” Spark per se, when using “Spark on YARN.” You either: (a) download the binary from spark.apache.org, (b) download the source code from spark.apache.org and build it or (c) use the pre-installed, often outdated, distribution’s RPM to install it. That said, there is an optional Spark Job History server that can be shared (or separate for each used version).
Spark Deployment Options
There are options for deploying Spark. For many enterprise use cases, Spark on YARN makes the most sense because it will run in harmony with other YARN applications on the Hadoop cluster (e.g. MapReduce) while leveraging the scheduling and management capabilities of YARN.
Specifically, Spark can be deployed on a local machine, on a Mesos cluster, on a dedicated Spark cluster (with Spark master and workers installed on data notes), or on a Hadoop cluster using YARN.
If using YARN, Spark can be deployed on a specific client, where the driver runs in the client-side JVM on the edge node; or it can be deployed to the cluster, where the driver runs in the application master YARN container. With the YARN approach, as illustrated in the previous section, different users can use different Spark versions.
Spark Runtime Options
Spark also has many runtime options, from using the command line interfaces to Spark jobs that run classes that the user has in a jar.
Among the options are:
- Interactive shells (CLI)
- Jobs via spark-submit (scala, Java or Python)
- Via SparkR library
- Thrift-based JDBC Server
- Emerging: Hive & Pig “on Spark”
Getting Started with Spark
The Hadoop distributions bundle Spark nowadays, but the Spark that is bundled is often outdated or is tailored for use in a “Spark cluster,” not for a Hadoop-YARN cluster. Therefore, the best way to get Spark is to download the pre-built binaries from spark.apache.org or to download and compile from the source code.
If you choose to download the pre-built binaries, there is still some small configuration to be done. First, copy hive-site.xml to Spark’s conf directory (for Hive support). Secondly, edit Spark’s log4j properties (you can copy the template file). You will also want to reduce logging from INFO to WARN or else the command-line interface will seem very noisy.
We recommend having a parent “sparks” directory under which the various Spark versions reside. Production systems using Spark in an automated fashion will likely have to lag behind in terms of Spark version, but interactive users will be able to use the latest-greatest version, if they wish.
Spark It Up
The moral of the story is that you’re not stuck with the Hadoop distribution-supplied Spark package, and you can use the latest-greatest if you wish (or set your own adoption schedule). There’s little to stop you from Sparking It Up now. Using Spark on YARN lets you leverage the resource management and scheduling features of YARN with little effort. No “installation” required!
About the author: Craig Lukasik is senior solution architect at Zaloni. Craig is highly experienced in the strategy, planning, analysis, architecture, design, deployment
and operations of business solutions and infrastructure services. He has a wide range of solid, practical experience delivering solutions spanning a variety of business and technology domains, from high-speed derivatives trading to discovery bioinformatics. Craig is passionate about process improvement (a Lean Sigma Green Belt and MBA) and is experienced with Agile (Kanban and Scrum). Craig enjoys writing and has authored and edited articles and technical documentation. When he’s not doing data work, he enjoys spending time with his family, reading, cooking vegetarian food and training for the occasional marathon.