June 6, 2016

Spark Makes Inroads into NoSQL Ecosystem

Alex Woodie

Apache Spark may have gained fame for being a better and faster processing engine than MapReduce running in Hadoop clusters. But the in-memory software is increasingly finding use outside of Hadoop, including integration with operational NoSQL databases.

Spark is currently supported in one way or another with all the major NoSQL databases, including Couchbase, Datastax, and MongoDB. Datastax may have been the first to announce support for Spark nearly two years ago with Apache Cassandra, but today all of the “big three” open source NoSQL database offer Spark connectors. And Spark is supported in some manner with a range of other NoSQL databases, including those from Aerospike, Apache Accumulo, Basho‘s Riak, Neo4J, Redis, and MarkLogic.

The primary use case for deploying Spark and NoSQL databases together involves bridging the transactional and analytic divide. Practical examples typically fall into several buckets, including powering product recommendations in an ecommerce site, doing deep analyses of IoT data, driving customer-360 initiatives, and detecting fraud.

Spark and NoSQL make a good combination, as they complement each other’s strengths. Organizations today are often picking NoSQL databases over relational databases to power large-scale Web, mobile and IoT applications that need schema flexibility, support for semi-structured data types like JSON, and horizontal scalability on commodity hardware.

And just as relational data warehouses from Teradata (NYSE: TDC) , IBM (NYSE: IBM), and Oracle (NYSE: ORCL) have traditionally been the source of business insights that are put into action with relational databases, we’re now seeing Apache Spark take on that role. Analysts and data scientists use Spark to crunch all sorts of data in search of business insights, via machine learning algorithms, graph analytics, or straight SQL analyses. Those insights are then fed back into the operational system, which increasingly resides atop a NoSQL database.

Demand for Spark capabilities is growing, according to Couchbase’s director of big data product management Will Gardella. “Definitely there are a lot of people who are asking for it,” he says. “Spark is really popular and it’s getting a lot of mindshare. A lot of the stuff people used to do in Hadoop, they like to look at Spark for.”

Couchbase has offered a Spark Connector for its NoSQL database for over a year. Just as other NoSQL vendors, Couchbase’s Spark connector enables Couchbase data to be materialized as Spark DataFrames and Datasets, which makes that data available to Spark’s SQL, machine learning, and graph APIs.

Today at the Spark Summit, Couchbase announced Spark Connector version 1.2. Gardella says the main focuses with the new connector are around speed and performance.

The first way the new connector improves speed is through better data locality. “In Couchbase, we know where every document on the Couchbase cluster is,” Gardella says. “We can use that information to find exactly where to go for certain kinds of queries so they can be ultra-efficient. We get exactly the piece of data we need to exactly the nodes that need to use them.”

The second big new feature in the connector is support for predicate pushdown. By essentially pre-filtering Spark queries against the Couchbase database and scanning only against data that matches the query, Couchbase-Spark users can cut down on the amount of data sent over the wire.

While Couchbase hasn’t finished benchmarking the new features and isn’t saying how much of a performance boost to expect, Gardella insists the gains will be substantial. “We’ve always had some abilities to interact with Spark data programmatically from the Couchbase connector,” he tells Datanami. “Now we have additional improvements to performance that are pretty significant.”

Connecting transactional and analytical systems is probably the most commonly cited use case for integrating Spark and NoSQL databases. But not to be overlooked is the fact that Spark does so many things so well that it’s essentially become a Swiss Army knife for developers and administrators.

For example, because Spark has so many connectors to all sorts of databases and file systems, it’s become a defacto standard for data integration. Instead of writing ETL scripts to move data among databases, developers can just fire up a Spark shell and start writing commands, Gardella says.

“I don’t really have to do a whole lot of work,” he says. “And if I want to run that script at scale I can deploy it to a cluster, but if I’m just fiddling around, I can do that on my laptop. That’s really cool. That’s something that people really, really like. There’s a lot of interest in the Spark connector just from that perspective. ‘Hey this is a handy little thing in my toolset that lets me do cool things that are useful and doesn’t take a huge investment.’ It’s really different from Hadoop in that way.”

See Spark Run on NoSQL, DataStax Says

Couchbase Passes MongoDB in Functionality, CEO Claims

Applications: Enterprise Analytics

Technologies: Middleware, Storage

Sectors: Financial Services, Healthcare, Manufacturing, Retail

Vendors: Aerospike, Basho, Couchbase, DataStax, IBM, MarkLogic, MongoDB, Oracle, Teradata

Tags: apache spark, data science, NoSQL databases

Only registered users may comment. Register using the form below.

Check off newsletters you would like to receive*
- HPCwire
- EnterpriseTech
- Datanami
- Technology Conferences & Events
- Advanced Computing Job Bank
- Technology Product Showcase
Email*
Name*
First Last
Organization*
Job Function*
Industry*
Country*
City*
State*
Province*
- Please check here to receive valuable email offers from Datanami on behalf of our select partners.

Spark Makes Inroads into NoSQL Ecosystem

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

April 18, 2024

April 17, 2024

April 16, 2024

Sponsored Partner Content

Get your Data AI Ready – Celebrate One Year of Deep Dish Data Virtual Series!

Supercharge Your Data Lake with Spark 3.3

Learn How to Build a Custom Chatbot Using a RAG Workflow in Minutes [Hands-on Demo]

Overcome ETL Bottlenecks with Metadata-driven Integration for the AI Era [Free Guide]

Gartner® Hype Cycle™ for Analytics and Business Intelligence 2023

The Art of Mastering Data Quality for AI and Analytics

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Building an Operational Data Warehouse for Real-time Analytics

Can You Use Kafka as a Database?

Sponsored Multimedia

The Power of DataOps: Bring Automation to Life
No Comments

Tactical Steps for Cloud Migration
No Comments

Immuta Data Access Platform
No Comments

Data Mesh: Fact or Fiction?
No Comments

Contributors

Featured Events

Call & Contact Center Expo

AI & Big Data Expo North America 2024

AI Hardware & Edge AI Summit 2024

CDAO Government 2024

Spark Makes Inroads into NoSQL Ecosystem

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

April 18, 2024

April 17, 2024

April 16, 2024

Most Read Features

Most Read News In Brief

Most Read This Just In

Sponsored Partner Content

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Sponsored Multimedia

Contributors

Featured Events

Share

Copy short link