How Spark Democratizes Analytic Value from Hadoop Lakes
So you’ve installed Hadoop and built a data lake to house all the bits and bytes that your organization previously discarded. So now what? If you follow the advice from industry experts, the next step on your analytics journey is to add Apache Spark to the mix.
It’s common for people to confuse Hadoop with analytics, says Rob Thomas, vice president of product development at IBM Analytics. “Hadoop itself doesn’t do analytics,” Thomas tells Datanami. “Hadoop is the data storage platform. Spark is the analytics platform. It’s really misunderstood, I think.”
Thomas is among the growing chorus of analytic experts and business intelligence leaders who are singing the praises of Apache Spark. Barely a year into its stint as a top-level Apache project, Spark is already well on its way to solidifying its grip as the go-to tool powering analytics atop Hadoop and behind. While the big data world continues to churn out new project after new project, Spark is maintaining a very high-level of interest as a key component of the emerging big data stack—if not the linchpin holding it all together.
What makes Spark so powerful? There are several factors, including its speed relative to MapReduce and its unified programming model built on Scala. But what really makes Spark sing is its capability to traverse different data repositories, including Hadoop data lakes.
“The biggest limitation on analytics in enterprises today is the fragmentation of data,” Thomas continues. “It’s the world that all the IT vendors have created, where a client has hundreds of different repositories of data. Different people are allowed to access different repositories. Nobody has a holistic view, so that limits the impact of analytics in an organization.”
Companies that can democratize access to that data will have an edge over their peers, Thomas says. “But to do that, you need some kind of a processing layer that’s independent of a repository but can provide you access to data,” he says. “In my mind, that is Spark.”
Spark As ‘Unifying Force’
IBM made a big splash in the Spark pool earlier this summer when it announced a major initiative to invest in Spark and embed the in-memory framework into a variety of its products and services. It also partner with Databricks, the company behind Spark, donated its SystemML machine learning framework the Spark project, and committed to helping to train more 1 million data scientists.
Another analytics firm using Spark to make multiple data sets appear as one is Zoomdata, an up-and-coming provider of BI tools that uses its patented “micro query” and “data sharpening” techniques to visualize huge sets of data in a real time manner.
“It’s important that we don’t move the data. We try to process the data in place as much as possible,” Zoomdata’s product manager Scott Cappiello told Datanami recently. “To the extent that we need to do any joining between the data, we actually leverage Spark to do that.”
Another big data startup building on Spark is Cognitive Scale, a Texas software company that combines graph analytics, machine learning, and cognitive computing to deliver industry-specific analytic solutions that adapt over time.
“Big data really is the fuel for cognitive and analytic systems. But that fuel today is really unrefined and raw,” says Cognitive Scale co-founder and CTO Matt Sanchez. “A lot of companies have spent time collecting that information and storing it. But that’s been building the pipes or the plumbing. That’s been the focus of the big data and the Hadoop ecosystem.”
Once companies have installed Hadoop and filled it with big data, then Sanchez (former head of Watson Labs) and his colleagues can go to work with Spark-powered apps. “We put the cognitive cloud right down next to the data lake and we can start to pull information from that data lake and be able to compute it in a way that allows us generate insight and actionable learning, and package that up as insights for real human beings, not just data scientists,” he says.
Hadoop and Spark: Living Together
Spark doesn’t need Hadoop, a fact that has spurred speculation that Spark will eventually leave its elephantine cousin in the dust. While the future is notoriously hard to predict, that eventuality looks unlikely because of how well the two products work together.
According to Syncsort president Josh Rogers, Spark will emerge as the winning engine for doing machine learning in Hadoop data lakes, and it may even give other SQL engines a run for the money. “If I’ve already got my data in HDFS, my ability to apply Spark to it is super useful, so I can probably think of Spark as one of the key projects within Hadoop,” he says.
While you can run Spark in a Hadoop-less cloud, on a beefy workstation, or even in the Cassandra NoSQL database, standalone Spark clusters are few and far between, thanks to easy access through the Hadoop distributors, Rogers says.
“I believe that what effectively has already happened is that Spark has been subsumed into the Hadoop family of projects,” he says. “The Hadoop distributors have really embraced Spark – Cloudera early on and Horton a bit later. Most people are buying and getting support for their Spark implementations through one of the Hadoop distributors.”
The combination of Spark and Hadoop is like chocolate and peanut butter, IBM’s Thomas says. “Whenever I talk to clients, I actually tell them, you need Hadoop [and] you need Spark. I believe they need both,” IBM’s Roberts says. “They serve a fundamentally different purpose…If you’re trying to store data at a really low cost, Hadoop is great for that. If you actually want to do analytics, you need Spark. They’re complementary in that respect.”
Spark provides the analytic engine that Hadoop data lakes really need, Thomas continues. “Don’t get me wrong — we love Hadoop. But it hasn’t lived up to the analytic promises,” he says. “People are really looking for real-time insights and they see Spark as the answer for that…As people understand it they realize that this is a gamechanger and this enables them to do analytics at a level they could never do before.”
Hortonwork’s vice president of corporate strategy Shaun Connolly agrees that Spark is generating interest, but disagrees with the notion Hadoop has failed to provide useful analytics for data lakes.
“The reality is, if you look at HDP, we’ve integrated in a whole range of data processing engines, and YARN-enable things like SAS‘s LASR Analytic Server, and even things Pivotal HAWQ and HP Vertica, to run natively in a Hadoop system,” Connolly tells Datanami. “I would venture to guess they are analytic providers!”
Connolly says Hortonworks’ vision for Hadoop has always centered on having a mix of different analytic engines powering different big data workloads. “Spark clearly is one of them,” he says, adding that it’s used by about 30 percent of Hortonworks customers.
“There’s definitely interest,” he continues, “and as it becomes hardened, I think we expect it to be used for more use cases, which is great for me because at the end of the day, you need to do interesting things on your lake of data, and that will be an engine for a variety of applications.”