Overcoming Spark Performance Challenges in Enterprise Hadoop Environments
Interest in Apache Spark is ballooning as word spreads about the real advantages it brings to the world of big data analytics. But like most new technologies, adopting Spark is not always smooth sailing–particularly if you’re running Spark jobs on a production Hadoop cluster. Here are some ways big data engineers are overcoming those challenges.
You don’t have to run Spark on Hadoop. It also runs on the Mesos resource scheduler from AMPlab, the Apache Cassandra NoSQL database, and the in-memory relational database from MemSQL database. You can also access it as a standalone cloud service from Databricks, the company behind Spark. But considering the speed and simplicity advantages that Spark provides over MapReduce, it’s easy to see why so many big data coders and software vendors targeting Hadoop are shifting to Spark. In fact, they’re doing it in droves.
But as Spark spreads into the real world, it’s hitting the occasional speed bump, which is to be expected. Frankly, the pace at which Spark is being adopted is just incredible, and that says a lot about what Spark does and how successfully it fills a need for organizations that are operationalizing their data science insights.
Chris McKinlay, senior data scientist at the Los Angeles-based data science consultancy Data Science, uses Spark extensively in his engagements. As a key component of the so-called “SMACK” (Spark, Mesos, Akka, Cassandra, and Kafka) stack, Spark plays a major role in executing the big data business plans of startups that Data Science works with around the region.
“We went through a growing phase where we just started coding more Scala and Spark and it just totally changed our organization,” McKinlay tells Datanami. “With platforms like Spark in particular, the code that you write to do small exploratory data analysis projects can be immediately scaled right up, oftentimes with the exact same code, line for line. That’s mainly true if coding in Scala.”
Hadoop’s place in the SMACK stack is up for debate. Folks like McKinlay say HDFS is just assumed to underlie everything, while others, like Patrick McFadin, chief evangelist for Apache Cassandra at Cassandra-backer DataStax, say Hadoop “fits into the slow data space.”
In any event, Spark and Hadoop have a long life ahead of them, and it behooves them to get along. The problem is, they don’t always play well together, thanks to the in-memory nature of Spark and its reputation as being a resource hog in some circumstances.
One way to respond to Spark resource problems on Hadoop clusters is to add more hardware. Spark is primarily an in-memory framework, so adding more RAM will give you an immediate performance boost. Alternatively, you could get more memory by adding more nodes and spreading out the data and processing even more.
You could also use Hadoop node labels to identify individual nodes in the cluster that have different hardware characteristics, such as having extra CPU or having extra memory. This feature, which was added with Apache Hadoop version 2.6 in late 2014, gives the Hadoop user some additional control over which applications run on which nodes.
Companies can also run Spark on their own isolated clusters, whether on-premise or in the cloud. This will ensure that the Spark jobs always have full access to all of the hardware resources in that cluster, and that the Spark job will not take away any resources from other jobs. Of course, you’ll need to move the data from your warehouse or lake, which will often be a Hadoop cluster. That cuts into some of the benefits.
Big data vendors have started to offer their own solutions to the problem of resource contention in Spark environments. Last month IBM debuted a new offering called the IBM Platform Conductor for Apache Spark that aims to give customers more control over their Spark environment.
The offering, which is part of IBM’s “hyperscale convergence” strategy, provides customers with more fine-grained resources scheduling than what they get through Mesos or YARN. It also allows users to run multiple Spark instances and different Spark versions, the company says. It also provides security separation between the various Spark instances; lifecycle management for Spark projects; and an option to use HDFS.
That doesn’t provide Spark-on-Hadoop users much help. One possible solution comes from Sean Suchter, the CEO and founder of Pepperdata, who has a lot of experience running Hadoop clusters. (In fact, he ran the first Hadoop cluster when it was turned on 10 years ago at Yahoo.)
Pepperdata’s software helps optimize Hadoop clusters by enforcing service level agreements (SLAs) and making sure all the different engines play nicely together. More than one-third of Pepperdata’s clients are currently using Spark, which jibes with overall industry trends.
According to Suchter, the Pepperdata software can control Spark jobs in the same manner that it plays traffic cop to other Hadoop workloads, such as MapReduce, Hbase, Hive, and Impala. One customer saw a 90 percent gain in performance after using Pepperdata’s software to tune the cluster, Suchter says.
Spark In Action
One large telecommunications analytics firm in particular succinctly demonstrated the need for resource management when Spark jobs goes into production. The company (which Suchter declined to identify) had adopted Spark in a big way and was using Spark in three critical ways on its various Hadoop clusters, the largest of which is 800 nodes.
The first way the telecom company used it was gathering data from the field (from cell towers and network segments) to identify problems in the network. The company used Spark Streaming to parse the data as it comes and this job has a very low SLA measured in seconds.
The company was also using Spark jobs to run complex ad hoc analytics queries upon daily rollups of data, which were assembled using long-running MapReduce jobs. The ad-hoc queries had SLAs measured in minutes. The third area involved the queries used to dashboards that appear in front of the customer service representatives (CSRs) working in call centers. The Spark jobs have latencies measured in milliseconds, since they don’t want customers waiting for information on the phones, Suchter says.
Balancing those different SLAs can be tricky in Spark, Suchter says. “The problem is without any kind of active data management, you’re desired stack of priorities might be exactly inverted in terms of what actually happens,” he says. “You might be humming along fine in terms of customer service queries and your streaming ingest, but if somebody comes in with some serious ad hoc workload right then, or if the data enrichment hits a snag and needs more processing, those things can overwhelm the cluster.”
Hadoop as Tetris
Yes, Spark jobs can be unpredictable in how they consume resources, which may give your IT team fits. But the benefits of Spark are too great to ignore, so organizations will need to find a way to adapt to it.
Suchter likens it to a game of Tetris, a strategic shape-fitting game that youth played in the previous millennium before GPUs filled their neocortices with fast-moving images and surround sound. “When you’re playing Tetris, once you start filling in the gaps, if it keeps giving you the same piece over and over again, it’s really hard to successfully plug those gaps,” he says.
“But if you keep getting different pieces, you can pack it really well,” he continues. “So there a real advantage to having different engines that are different shaped pieces. The idea of having Spark being a different shaped piece than MapReduce and a different shaped piece than HBase and Impala – that’s really useful to have one more piece to fit into your cluster.”