August 24, 2015

How Spark Democratizes Analytic Value from Hadoop Lakes

Alex Woodie

(Risto Viita/Shutterstock)

So you’ve installed Hadoop and built a data lake to house all the bits and bytes that your organization previously discarded. So now what? If you follow the advice from industry experts, the next step on your analytics journey is to add Apache Spark to the mix.

It’s common for people to confuse Hadoop with analytics, says Rob Thomas, vice president of product development at IBM Analytics. “Hadoop itself doesn’t do analytics,” Thomas tells Datanami. “Hadoop is the data storage platform. Spark is the analytics platform. It’s really misunderstood, I think.”

Thomas is among the growing chorus of analytic experts and business intelligence leaders who are singing the praises of Apache Spark. Barely a year into its stint as a top-level Apache project, Spark is already well on its way to solidifying its grip as the go-to tool powering analytics atop Hadoop and behind. While the big data world continues to churn out new project after new project, Spark is maintaining a very high-level of interest as a key component of the emerging big data stack—if not the linchpin holding it all together.

What makes Spark so powerful? There are several factors, including its speed relative to MapReduce and its unified programming model built on Scala. But what really makes Spark sing is its capability to traverse different data repositories, including Hadoop data lakes.

“The biggest limitation on analytics in enterprises today is the fragmentation of data,” Thomas continues. “It’s the world that all the IT vendors have created, where a client has hundreds of different repositories of data. Different people are allowed to access different repositories. Nobody has a holistic view, so that limits the impact of analytics in an organization.”

Companies that can democratize access to that data will have an edge over their peers, Thomas says. “But to do that, you need some kind of a processing layer that’s independent of a repository but can provide you access to data,” he says. “In my mind, that is Spark.”

Spark As ‘Unifying Force’

IBM made a big splash in the Spark pool earlier this summer when it announced a major initiative to invest in Spark and embed the in-memory framework into a variety of its products and services. It also partner with Databricks, the company behind Spark, donated its SystemML machine learning framework the Spark project, and committed to helping to train more 1 million data scientists.

Another analytics firm using Spark to make multiple data sets appear as one is Zoomdata, an up-and-coming provider of BI tools that uses its patented “micro query” and “data sharpening” techniques to visualize huge sets of data in a real time manner.

“It’s important that we don’t move the data. We try to process the data in place as much as possible,” Zoomdata’s product manager Scott Cappiello told Datanami recently. “To the extent that we need to do any joining between the data, we actually leverage Spark to do that.”

Another big data startup building on Spark is Cognitive Scale, a Texas software company that combines graph analytics, machine learning, and cognitive computing to deliver industry-specific analytic solutions that adapt over time.

“Big data really is the fuel for cognitive and analytic systems. But that fuel today is really unrefined and raw,” says Cognitive Scale co-founder and CTO Matt Sanchez. “A lot of companies have spent time collecting that information and storing it. But that’s been building the pipes or the plumbing. That’s been the focus of the big data and the Hadoop ecosystem.”

Once companies have installed Hadoop and filled it with big data, then Sanchez (former head of Watson Labs) and his colleagues can go to work with Spark-powered apps. “We put the cognitive cloud right down next to the data lake and we can start to pull information from that data lake and be able to compute it in a way that allows us generate insight and actionable learning, and package that up as insights for real human beings, not just data scientists,” he says.

Hadoop and Spark: Living Together

Spark doesn’t need Hadoop, a fact that has spurred speculation that Spark will eventually leave its elephantine cousin in the dust. While the future is notoriously hard to predict, that eventuality looks unlikely because of how well the two products work together.

According to Syncsort president Josh Rogers, Spark will emerge as the winning engine for doing machine learning in Hadoop data lakes, and it may even give other SQL engines a run for the money. “If I’ve already got my data in HDFS, my ability to apply Spark to it is super useful, so I can probably think of Spark as one of the key projects within Hadoop,” he says.

While you can run Spark in a Hadoop-less cloud, on a beefy workstation, or even in the Cassandra NoSQL database, standalone Spark clusters are few and far between, thanks to easy access through the Hadoop distributors, Rogers says.

“I believe that what effectively has already happened is that Spark has been subsumed into the Hadoop family of projects,” he says. “The Hadoop distributors have really embraced Spark – Cloudera early on and Horton a bit later. Most people are buying and getting support for their Spark implementations through one of the Hadoop distributors.”

The combination of Spark and Hadoop is like chocolate and peanut butter, IBM’s Thomas says. “Whenever I talk to clients, I actually tell them, you need Hadoop [and] you need Spark. I believe they need both,” IBM’s Roberts says. “They serve a fundamentally different purpose…If you’re trying to store data at a really low cost, Hadoop is great for that. If you actually want to do analytics, you need Spark. They’re complementary in that respect.”

Spark provides the analytic engine that Hadoop data lakes really need, Thomas continues. “Don’t get me wrong — we love Hadoop. But it hasn’t lived up to the analytic promises,” he says. “People are really looking for real-time insights and they see Spark as the answer for that…As people understand it they realize that this is a gamechanger and this enables them to do analytics at a level they could never do before.”

Hortonwork’s vice president of corporate strategy Shaun Connolly agrees that Spark is generating interest, but disagrees with the notion Hadoop has failed to provide useful analytics for data lakes.

“The reality is, if you look at HDP, we’ve integrated in a whole range of data processing engines, and YARN-enable things like SAS‘s LASR Analytic Server, and even things Pivotal HAWQ and HP Vertica, to run natively in a Hadoop system,” Connolly tells Datanami. “I would venture to guess they are analytic providers!”

Connolly says Hortonworks’ vision for Hadoop has always centered on having a mix of different analytic engines powering different big data workloads. “Spark clearly is one of them,” he says, adding that it’s used by about 30 percent of Hortonworks customers.

“There’s definitely interest,” he continues, “and as it becomes hardened, I think we expect it to be used for more use cases, which is great for me because at the end of the day, you need to do interesting things on your lake of data, and that will be an engine for a variety of applications.”

Python Versus R in Apache Spark

IBM, Databricks Join Forces to Advance Spark

Applications: Data Mining, Enterprise Analytics, Predictive Analytics

Technologies: Frameworks

Sectors: Financial Services, Healthcare, Retail

Vendors: Cognitive Scale, Databricks, HP, IBM, pivotal, Zoomdata

Tags: apache spark, data lakes, Hadoop, Spark

Only registered users may comment. Register using the form below.

Check off newsletters you would like to receive*
- HPCwire
- EnterpriseTech
- Datanami
- Technology Conferences & Events
- Advanced Computing Job Bank
- Technology Product Showcase
Email*
Name*
First Last
Organization*
Job Function*
Industry*
Country*
City*
State*
Province*
- Please check here to receive valuable email offers from Datanami on behalf of our select partners.

How Spark Democratizes Analytic Value from Hadoop Lakes

Spark As ‘Unifying Force’

Hadoop and Spark: Living Together

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

May 10, 2024

May 9, 2024

May 8, 2024

May 7, 2024

Sponsored Partner Content

Get your Data AI Ready – Celebrate One Year of Deep Dish Data Virtual Series!

Supercharge Your Data Lake with Spark 3.3

Learn How to Build a Custom Chatbot Using a RAG Workflow in Minutes [Hands-on Demo]

Overcome ETL Bottlenecks with Metadata-driven Integration for the AI Era [Free Guide]

Gartner® Hype Cycle™ for Analytics and Business Intelligence 2023

The Art of Mastering Data Quality for AI and Analytics

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Top 6 Strategies for Reducing Data Warehouse Costs

Building an Operational Data Warehouse for Real-time Analytics

Sponsored Multimedia

The Power of DataOps: Bring Automation to Life
No Comments

Tactical Steps for Cloud Migration
No Comments

Immuta Data Access Platform
No Comments

Data Mesh: Fact or Fiction?
No Comments

Contributors

Featured Events

AI & Big Data Expo North America 2024

CDAO Canada Public Sector 2024

AI Hardware & Edge AI Summit Europe

AI Hardware & Edge AI Summit 2024

CDAO Government 2024

How Spark Democratizes Analytic Value from Hadoop Lakes

Spark As ‘Unifying Force’

Hadoop and Spark: Living Together

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

May 10, 2024

May 9, 2024

May 8, 2024

May 7, 2024

Most Read Features

Most Read News In Brief

Most Read This Just In

Sponsored Partner Content

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Sponsored Multimedia

Contributors

Featured Events

Share

Copy short link