2015 – Spark Takes the Big Data World by Storm

When Cal Berkeley’s AMPLab released Spark as an open source product back in 2010, nobody could have foreseen the huge impact that it would have on the big data ecosystem – an impact that continues to this day.

The original idea of Spark’s creator, Matei Zaharia was to build a better and faster version of MapReduce, which at that time was the main execution engine in Hadoop. While it could crunch huge amounts of data stored across many disks linked with HDFS, MapReduce’s linear data flow (read data from disk, map a function, reduce the results, and then write back to disk) made it painfully slow. It was notoriously difficult to program, too.

Early Spark users reported benefited from sizable performance boosts compared to MapReduce, ranging from 10x to 100x. The core breakthrough that enabled this speed-up was the creation of a resilient distributed dataset (RDD), which could keep working data sets in a distributed shared memory architecture. This enabled programmers to write iterative (i.e. MapReduce-style batch) as well as interactive (i.e. database-style) applications in Java, Scala, and Python.

Spark adoption started out fairly modestly. It was just one obscure open-source project from academia, competing with hundreds of other projects to gain traction in a bustling big data ecosystem. Hadoop was unquestionably the big elephant in the room, and garnered most of the attention.

But by 2013, Spark was starting to turn some heads in the computing community. That was the year that Spark was accepted as an incubator project at the Apache Software Foundation, as well as the year that Zaharia and his advisors and colleagues at AMPlab co-founded Databricks.

Spark amassed considerable momentum in 2014. Yahoo, which effectively ran its whole business on Hadoop, adopted Spark for machine learning and BI workloads on its Hadoop cluster. That same year, Spark and its various add-ons, including Spark Streaming, MLlib, GraphX, and Shark SQL (which would be renamed Spark SQL) became a top level project at the ASF. However, Spark was still fairly new at the time, and some enterprises were hesitant to adopt it for production workloads before its scalability was proven.

But by the end of 2014 and the start of 2015, the writing was on the wall: Spark would become a computing force that had to be reckoned with. Cloudera was dedicating resources to it, as were MapR and Google Cloud. Numerous third-party vendors that played in the Hadoop ecosystem with names like ClearStory Data, Platfora, Datameer, and Trifacta, all adopted Spark as the run-time engine. The ETL vendors followed suit with the all-important data transformation workloads that previously relied on MapReduce. Hortonworks, which initially backed Tez, eventually came around and threw its support behind Spark too.

The Strata + Hadoop World conference that took place in the Spring of 2015 perhaps should have been called “Strata + Spark World.” Intel came out in full support of Spark and a pledge to improve Spark performance on its processors, joining dozens of other industry players that were re-writing the internals of their applications to incorporate Spark.

The open source project continued to grow. According to Zaharia’s keynote at Strata + Spark–er, Strata + Hadoop World–the number of contributors to the Spark project exceeded 500 members, up from less than 200 at the beginning of 2014. The size of the codebase nearly doubled during that time from 190,000 lines of code to about 370,000, said Zaharia, who was named a Datanami Person to Watch for 2016.

Spark gained an important new feature with the release of Spark 1.3 in 2015: a DataFrame API. Instead of coding to the RDDs, developers could write their Spark applications to access a Spark DataFrame, which uses table-like structures to cached data in memory. This abstraction layer made Spark programming more accessible, and undoubtedly widened its adoption. At the same time, the project also renamed the Shark project to Spark SQL, which would soon become the most popular subproject.

Amazon Web Services got the Spark bug in 2015, and added support for it to Elastic MapReduce (EMR), it’s Hadoop runtime. There is likely some old MapReduce code running in EMR today, but it’s likely mostly Spark at this piont. The same goes for Microsoft Azure, which supports Spark in its HDInisight Hadoop runtime. On Google Cloud, customers can get Spark service through its Dataproc offering.

The release of Spark 1.4 later in 2015 introduced support for R, giving customers another language to code Spark applications. IBM, which had developed its own Hadoop distribution called BigInsights, announced that it was dedicating 300 engineers to work on Spark (as well as Hadoop), and introduced Spark on its System z mainframe later in the year.

By September of that year, it was becoming clear that 2015 would be the “Year of Spark.” Doug Cutting, the co-creator of Hadoop and Cloudera’s chief architect, praised the open source project as “an all-around win” for Hadoop. The same month, Cloudera’s chief technologist said that Spark “was the future of Hadoop.”

Meanwhile, the folks at Databricks were hard at work trying to improve Spark. Since it was written in Scala, the Spark framework runs in a Java Virtual Machine (JVM), which isn’t exactly necessarily known for high-performance computing. With Project Tungsten, Databricks co-founder and chief architect, Reynold Xin (also a Datanami Person to Watch for 2017), hoped to keep Spark zooming along in a faster world full of 40 GbE and SSDs.

As 2015 came to an end, Spark was clearly on its way to doing something special in the big data space, and even fulfilling the promise of becoming the data platform that Hadoop eventually failed to become. Ali Ghodsi, Zaharia’s AMPLab advisor and the co-founder and CEO of Databricks, cited Spark’s flexibility, simplicity, and ability to unify compute paradigms as key to its success.

“It unified all of these different types of analytics under one framework, whereas Hadoop didn’t have, for instance, the machine learning, SQL, or other components, such as the real-time component,” Ghodsi, a Datanami Person to Watch for 2019, said in an interview for that honor. “So bringing what we call ‘unified analytics’ under one umbrella is what made it super powerful.”

Apache Spark isn’t perfect, but it has advanced the ball on data-intensive computing unlike any other piece of software today. The software is still going strong, and likely will for years to come.

2019 – DataOps: A Return to Data Engineering

2018 – GDPR and the Big Data Backlash

2017 – AI, Deep Learning, and GPUs

2016 – Clouds, Clouds Everywhere

2014 – NoSQL Has Its Day

2013 – The Flourishing Open Source Ecosystem

2012 – SSDs and the Rise of Fast Data

2011 – The Emergence of Hadoop