Apache Spark Is Great, But It’s Not Perfect
Apache Spark is one of the most widely used tools in the big data space, and will continue to be a critical piece of the technology puzzle for data scientists and data engineers for the foreseeable future. With that said, the open source technology isn’t perfect, and prospective users should be aware of its limitations when taking on new projects.
It’s hard to believe, but Apache Spark is turning 10 years old this year, as we wrote about last month. The technology has been a huge success, and become a critical component of many big data projects at companies and other organizations around the world. In fact, Spark has been such a massive hit that Databricks, the private, Spark-as-a-service company founded by the creators of Apache Spark, was reportedly recently valued at $2.75 billion, which is just below Hadoop leader Cloudera‘s stock market capitalization on the New York Stock Exchange.
We wrote about all this momentum behind Spark last month. In the interest in covering the topic thoroughly and fairly, we thought it was worthwhile to look at some of the limitations of Spark.
Here are three things about Spark that users should know:
1. Complex Performance Parameters
Originally created as an in-memory replacement for MapReduce, Apache Spark delivered huge performance increases for customers using Apache Hadoop to process large amounts of data. While MapReduce may never fully eradicated from Hadoop, Spark has become the preferred engine for real-time and batch processing.
Because of its preference for storing datasets in memory, it’s no surprise to learn that Spark likes RAM. A lot.
“Spark is definitely a memory hog,” says Bala Venkatrao, vice president of product for Unravel Data, a Menlo Park, California-based startup that helps companies maximize the production of their Spark and Hadoop clusters.
If you starve Spark of RAM, fail to grasp how it works, or make some other configuration error, all those Spark performance benefits you hoped to get will go flying out the window.
“It’s relatively easy to write something in Spark, but what happens under the hood in Spark gets complicated,” Venkatrao tells Datanami. “People just write a piece of code, submit it, and assume magic happens. Well, no, it cannot. You need to go and understand and tune all of that and figure out what’s going on.”
While Spark is orders of magnitude faster than the MapReduce framework it created – and easier to write applications too — there’s no free lunch with Spark, Venkatrao says. “You need to understand how Spark works,” he says. “But for the vast majority of users, it’s not easy to pick up all of the skills and learn that. That’s where a company like Unravel comes in.”
Complexity is a recurring theme in big data analytics, and unfortunately, Spark did not solve the complexity problem for the benefit of the big data community. (The search for bigfoot, alas, continues as well.)
“Spark’s a lot faster and a lot easier to work with than Hadoop,” says Bobby Johnson, who worked with Hadoop at Facebook and went on to co-found Interana, a Silicon Valley firm that uses a time-series database to surface behavioral insights in real time. “But it’s still pretty damn complicated.”
Spark, like Hadoop, was created on the architectural concept of a data lake where you can store all of your data and bring different compute engines to bear. That paradigm has delivered a lot of flexibility to customers, but it may not be the best data processing blueprint moving forward, Johnson says.
“Kafka is coming on the scene as a real player,” Johnson says. “I like the paradigm of a data pipeline. I feel it’s a much better framework for building strategically around than a data lake.”
2. Jack of All Trades
Spark’s reputation as the Swiss Army Knife of big data is well founded, with ETL (Spark Core), SQL (Spark SQL) machine learning (MLlib), real-time processing (Spark Streaming), and graph engines (GraphX) libraries; APIs for Python, R, Java, and Scala developers; and over 100 data source connectors developed by a thriving open source community.
But that generality can also be a double-edge sword, particularly for organizations seeking the highest development and run-time efficiency for specific analytic applications in production.
“Spark is a great fit for in-memory processing workloads and we see spark usage growing for batch ETL, streaming ETL, and machine learning workloads in our user base,” says Ashish Thusoo, the CEO of big data cloud company Qubole. “Given the multiple language support such as Python, R and Scala; multiple use case support; and growing community of Apache Spark, we think Spark is here to stay and thrive for the workloads it is a great fit for.
“But for ad hoc analytics and deep learning use cases, we are seeing users gravitating towards other open source engines and frameworks such as Presto and Tensorflow,” says Thusoo, who co-created of Apache Hive while leading the Data Infrastructure team at Facebook.
In particular, use of Presto ( which was also developed at Facebook to replace Hive) grew by 400% last year on the Qubole cloud, Thusoo says. Spark increased by nearly 300% during that time – not too shabby, but a bit behind Presto.
Spark’s popularity on Google Trends peaked in June 2015 timeframe, and contributions from the open source community are still strong. But Thusoo has seen enthusiasm for Spark begin to wane in recent years.
“One alarming trend I see is that while the open source community is growing, the contributions to open source Apache Spark are slowing down,” Thusoo tells Datanami. “We have seen similar trends in other open source projects as they become more legacy projects where the original contributors abandon pure open source contribution in favor of more proprietary capabilities. This is antithetical to the health of any open source project.”
3. Occasional Bugs
As Apache Spark has matured, it has been widely adopted across industries. Early concerns about scalability have been resolved, and the product is considered to be production grade. For proof of that, just look to IBM, which was an early believer in Spark and has integrated Spark into many of its own data products.
But that doesn’t mean Spark is completely without problems. One of the organizations that’s run into trouble with Spark is Walmart, the world’s largest company. The technology has been adopted across the organizations, and last month we learned how its R&D arm, Walmart Labs, used Spark’s machine learning capabilities to develop a new demand forecasting system for its roughly 4,700 stores in the United States.
When trained with historical sales data, the demand forecasting model that Walmart Labs’ data scientists built with Apache Spark managed to deliver a demand forecast for 500 million item-by-store combinations that was superior to its existing demand forecasting system, as we told you about last month. The only problem is that the Spark program didn’t run reliably.
Walmart Labs’ Spark application would run as expected one time, then generate “garbage” the next time around, Walmart Labs Distinguished Data Scientist and Director of Data Science John Bowman told an audience at Nvidia‘s GPU Technology Conference.
“So when this sort of thing happens, naturally we suspect some sort of memory leak somewhere,” he said. “But we couldn’t find it and we couldn’t figure out a way of working around it.”
Bowman said his team spent over six weeks attempting to debug and restructure the Spark code, but could not get it to work, and so the group ported the Spark code to Nvidia’s RAPIDS environment and ran it on a GPU cluster instead. That system generated the same quality of forecast as the Spark system, but without the random cluster errors, and will be implemented across the company’s entire U.S. operations by the end of the year, Bowman said.
While Spark is ready for primetime in demanding production environment, it doesn’t mean there are no bugs in Spark, says Ali Ghodsi, who was an advisor to Spark creator Matei Zaharia at the UC Berkeley AMPLab, and is also the CEO of Databricks.
“There are millions of lines of code [in Spark] at this point,” Gohdsi says. “People are using it at massive scale. At Databricks, we are using it for many petabyte-sized data lakes. But Spark is not without bugs, and that’s why we need the community. The community keeps patching bugs that they find. And sometimes new bugs creep in. So it’s still a very fast development project with a big, active community.”
Apache Spark has been a huge success and will continue to be relied upon to process big data sets in the years to come. It’s clearly a case of the right technology being introduced at the right time to have a massive impact. However, like all technology created by humans, Spark is not perfect. Developers who understand Spark’s limitations will have a better chance of finding success in their projects.