Apache Spark and Java 8: The Big Data Team for 2015
Apache Spark with Java 8 is proving to be the perfect match for Big Data. Spark 1.0 was just released this May, and it’s already surpassed Hadoop in popularity on the Web. Java 8, the latest version, came out in March and is spreading fast: As of October, a survey from Typesafe showed that two-thirds of developers had switched to Java 8 or were planning to switch soon, faster adoption than for earlier versions.
Ten years ago, Hadoop set the standard for batch-processing unstructured documents. Originally designed for search-engine indexes, it quickly emerged as the leading computing engine for Big Data more generally. But it lacked a lot of essential services, which developers of each application needed to reinvent. Over the years, essential layers like workflow management and direct support for structured data grew on top of Hadoop, and today, dozens of open-source projects give Hadoop the components that it needs. But the full collection of services is a cumbersome mess. It’s so hard to set up that developers often download a pre-configured virtual machine that bundles the open-source modules, resulting in a heavyweight and inflexible system.
Spark offers a new and powerful alternative to Hadoop, starting with the core processing functionality. Spark moves around data with a convenient abstraction called Resilient Distributed Datasets (RDDs), which are transparently pulled into memory on-demand from a variety of data stores such as the Hadoop Distributed File System (HDFS), NoSQL databases such as Cassandra, and more. With RDDs, Spark can execute sophisticated streaming and other parallel processing tasks behind the scenes on behalf of application developers. Spark can often run multistep algorithms in memory, which is much more efficient than dumping intermediate steps to a distributed file system as in Hadoop.
But the most important difference that Spark makes to a development team’s work is the convenience of its programming model. It offers a simple programming API with powerful idioms for common data processing tasks that require less coding effort than Hadoop. Even for the most basic processing tasks, Hadoop requires several Java classes and repetitive boilerplate code to carry out each step. In Spark, developers simply chain functions to filter, sort, transform the data, abstracting away many of the low-level details and allowing the underlying infrastructure to optimize behind the scenes. Spark’s abstractions reduce code size in an average Java application by about 30% as compared to Hadoop, resulting in shorter development times and more maintainable codebases.
Spark itself is written in the Scala language, and applications for Spark are often written in Scala as well. Scala’s functional programming style is ideal for invoking Spark’s data processing idioms to build sophisticated processing flows from basic building blocks, hiding the complexity of processing huge data sets.
However, relatively few developers know Scala; Java skills are much more widely available in the industry. Fortunately, Java applications can access Spark seamlessly. But this is still not ideal, as Java is not a functional programming language, and invoking the Spark’s programming model without functional programming constructs requires lots of boilerplate code: Not as much as Hadoop, but still too much meaningless code that reduces readability and maintainability.
Fortunately, Java 8 supports the functional style more cleanly and directly with a new addition to the language: “lambdas” which concisely capture units of functionality that are then executed by the Spark engine as needed. This closes most of the gap between Java and Scala for developing applications on Spark.
To compare Java 7 vs Java 8, I examined a collaborative filtering use case for employment matchups coded in each version of the language. In collaborative filtering, users rate different products, and then the machine learning algorithm judges how they would rate new products, recommending the ones most likely to be to their taste. In this case, the algorithm uses ratings that software professionals give to various companies in order to learn which employers would be the best match for a given employee. Data was accessed from the NoSQL database Cassandra, which was easy to do in Spark using DataStax’s open-source connector.
Considering that data mining and machine learning have always been the most widely touted use cases for Big Data, it’s surprising that machine learning is not included in the core package, so that in practice, most Hadoop deployments calculate only simple summaries like search indexes. Developing algorithms to run on Hadoop is practical only for the most mathematically sophisticated teams. The Apache Mahout project provides mature machine learning functionality, but is not tightly integrated with Hadoop, and in fact is moving away from Hadoop’s MapReduce architecture to run on Spark. Spark, on the other hand, makes machine learning easy with the tightly integrated MLlib library.
I compared Java 7 code with Java 8 code that accomplished the same tasks: Training the machine learning system on the Cassandra training dataset, then checking the results by predicting preferences on a validation dataset. In Java 7,and evaluating the error in these predictions and evaluating the error in these predictions there’s a lot of repetitive wrapper code needed to specify the functions that are composed for data processing. Though an Interactive Development Environment like Eclipse can generate the boilerplate automatically, it still makes the code harder to understand and update as requirements change. In Java 8, it’s far more straightforward, with lambdas expressing the functional units concisely and simply. This machine learning task takes 30% fewer lines in code in Java 8 as compared to Java 7, which makes not only for more efficient development, but for more readable and maintainable code going forward. Taken together with the reduced code complexity of Spark as compared to Hadoop, this means that Java 8 with Spark can require half the lines of a similar Hadoop program.
If you haven’t switched yet to Java 8, start planning now. It’s mature, with two service releases already out since the major version. Upgrading is easy, as version 8 maintains full backwards compatibility with existing codebases and skillsets. Even the new lambdas, which make functional programming so easy, are just a convenient syntax for programmers, and behind the scenes are compiled to the same bytecode as the clunky boilerplate for the functional interfaces used in Java 7, which are still supported.
Big Data processing in the form of Hadoop has been around for a decade, and Java has been around for a decade more. A lot has been done with these technologies so far, but we’re only just now getting started. With Spark and Java 8, developers and deployers who don’t have special expertise in Big Data can finally develop machine learning and other Big Data applications.
A code project that illustrates Java 8 (as well as Java 7) with Apache Spark on Cassandra is available at Joshua’s Github repository.
About the author: Joshua T. Fox is a consultant specializing in software architecture and technical evangelism. His background includes stints as senior architect and technical lead at IBM, HP, and VC-backed startups. Joshua holds a PhD from Harvard University, and his articles have appeared at leading publications including ReadWrite, JavaWorld, and InfoSecurity. More at joshuafox.com.