Spark Steals the Show at Strata
There was a lot of good stuff on display at last week’s Strata + Hadoop World conference. But if there was one product or technology that stood out from the pack, that would have to be Apache Spark, the versatile in-memory framework that is taking the big data world by storm. At Strata, Spark creator Matei Zaharia showed how the technology will get even more powerful in the months to come.
Spark has garnered an incredible amount of momentum, largely running within Hadoop as a replacement for first-gen MapReduce programs. In just the past 15 months, the in-memory framework has emerged as the best candidate for fulfilling Hadoop’s potential for analyzing vast amounts of data.
During his keynote at Strata + Hadoop World, Zaharia discussed where Spark has been over the past year, and where it will be going in the future. “By all metrics 2014 was really an amazing year for Spark,” says Zaharia, who developed Spark at the AMPLab and is currently the CTO at Spark startup Databricks.
Zaharia pointed out that Spark is now the most popular Apache project, ending the year with more than 500 members, up from less than 200 at the beginning of 2014. The size of the codebase nearly doubled, from 190,000 lines of code to about 370,000. These are all indications that Spark is on fire. But looking forward, there’s no reason to think Spark’s popularity will wane with new features aimed for version 1.3, which is due next week, and version 1.4, which is slated to ship in June.
“As we’ve seen the increase use of Spark and the new types of applications people want to do with it, we’re focusing on two areas,” Zaharia says. “The first one is data science. We see increasingly that the people who want to use Spark are not just software developers, but they’re data scientists, maybe experts in other fields who need to run computations on large data.”
The second direction the Spark backers are going is developing new platform interfaces. “We see Spark running in a very wide range of environments, not just Hadoop and the public cloud but also NoSQL storage and traditional data warehouse environments and so on, and we want to make it efficient” to run Spark there.
Spark version 1.3 will contain support for DataFrames, which will help data scientists work with Spark. “If you’re not familiar with DataFrames, they’re a very common API for working with data on a single machine,” Zaharia says. “They’re used in R…as well as Pandas, one of the most popular packages for Python. And they give you this very concise way to write expressions to do operations on data.”
Spark version 1.3 uses almost the same syntax for data frames as Pandas, says Zaharia. What’s more, the DataFrame implementation will be able to take advantage of the query optimizers in Spark SQL. “Instead of just running each step of the computation one by one uniquely, we look at the whole program and come up with a more efficient plan,” he says. “This leads to not just easier to write code, but much better performance.”
“I’m excited about this,” Zaharia continues. “It’s a game changer in terms of which types of users can access big data, how easy it is to use it efficiently, and this is going to come out in the next release of Spark.”
The following Spark release, version 1.4, will contain the full interface to R. “In R you’ll be we able to use data frames, RDDs [resilient distributed datasets], and machine learning libraries from the existing Spark, and just call it from R,” Zaharia says. “With this, Spark is going to talk to Scala Python Java and R, the four most popular languages for big data today.”
Spark 1.4 will also bring a standard API for accessing external data sources in Spark, which could widen the potential uses of the in-memory framework even further. “We’ve added a standard API for smart data sources that actually many other technologies are now plugging into,” Zaharia says. “This can give you back data frames that you can use in your programs or it can let you query the data in SQL.”
It’s rare that all of the data a user wants to query will reside in just one location, so being flexible in how data is pulled and processed from various sources is one key to success in big data analytics. This concept of the “virtualized data warehouse” is catching on, and it appears that Spark 1.4 will bring some capabilities in this regard.
“Instead of just naively scanning the data, [the API for external data sources] lets you also push queries and logic into the source, which is very important for big data to minimize the amount of work done,” Zaharia says. For example, the API could be used to join and combine user data in MySQL and log data in Hive in a single query, he says.
The Apache Spark project is focused on helping help users solve real-world big data problems, Zaharia says. “Our goal is very simple: We want to give you a single unified data engine that you can use for all your sources, workloads, and on-line environments,” he says. “So you just have to learn and manage this one tool to combine the many disparate types of data available. Our experience so far shows that is is both possible and very useful.”