Spark Makes Inroads into NoSQL Ecosystem
Apache Spark may have gained fame for being a better and faster processing engine than MapReduce running in Hadoop clusters. But the in-memory software is increasingly finding use outside of Hadoop, including integration with operational NoSQL databases.
Spark is currently supported in one way or another with all the major NoSQL databases, including Couchbase, Datastax, and MongoDB. Datastax may have been the first to announce support for Spark nearly two years ago with Apache Cassandra, but today all of the “big three” open source NoSQL database offer Spark connectors. And Spark is supported in some manner with a range of other NoSQL databases, including those from Aerospike, Apache Accumulo, Basho‘s Riak, Neo4J, Redis, and MarkLogic.
The primary use case for deploying Spark and NoSQL databases together involves bridging the transactional and analytic divide. Practical examples typically fall into several buckets, including powering product recommendations in an ecommerce site, doing deep analyses of IoT data, driving customer-360 initiatives, and detecting fraud.
Spark and NoSQL make a good combination, as they complement each other’s strengths. Organizations today are often picking NoSQL databases over relational databases to power large-scale Web, mobile and IoT applications that need schema flexibility, support for semi-structured data types like JSON, and horizontal scalability on commodity hardware.
And just as relational data warehouses from Teradata (NYSE: TDC) , IBM (NYSE: IBM), and Oracle (NYSE: ORCL) have traditionally been the source of business insights that are put into action with relational databases, we’re now seeing Apache Spark take on that role. Analysts and data scientists use Spark to crunch all sorts of data in search of business insights, via machine learning algorithms, graph analytics, or straight SQL analyses. Those insights are then fed back into the operational system, which increasingly resides atop a NoSQL database.
Demand for Spark capabilities is growing, according to Couchbase’s director of big data product management Will Gardella. “Definitely there are a lot of people who are asking for it,” he says. “Spark is really popular and it’s getting a lot of mindshare. A lot of the stuff people used to do in Hadoop, they like to look at Spark for.”
Couchbase has offered a Spark Connector for its NoSQL database for over a year. Just as other NoSQL vendors, Couchbase’s Spark connector enables Couchbase data to be materialized as Spark DataFrames and Datasets, which makes that data available to Spark’s SQL, machine learning, and graph APIs.
Today at the Spark Summit, Couchbase announced Spark Connector version 1.2. Gardella says the main focuses with the new connector are around speed and performance.
The first way the new connector improves speed is through better data locality. “In Couchbase, we know where every document on the Couchbase cluster is,” Gardella says. “We can use that information to find exactly where to go for certain kinds of queries so they can be ultra-efficient. We get exactly the piece of data we need to exactly the nodes that need to use them.”
The second big new feature in the connector is support for predicate pushdown. By essentially pre-filtering Spark queries against the Couchbase database and scanning only against data that matches the query, Couchbase-Spark users can cut down on the amount of data sent over the wire.
While Couchbase hasn’t finished benchmarking the new features and isn’t saying how much of a performance boost to expect, Gardella insists the gains will be substantial. “We’ve always had some abilities to interact with Spark data programmatically from the Couchbase connector,” he tells Datanami. “Now we have additional improvements to performance that are pretty significant.”
Connecting transactional and analytical systems is probably the most commonly cited use case for integrating Spark and NoSQL databases. But not to be overlooked is the fact that Spark does so many things so well that it’s essentially become a Swiss Army knife for developers and administrators.
For example, because Spark has so many connectors to all sorts of databases and file systems, it’s become a defacto standard for data integration. Instead of writing ETL scripts to move data among databases, developers can just fire up a Spark shell and start writing commands, Gardella says.
“I don’t really have to do a whole lot of work,” he says. “And if I want to run that script at scale I can deploy it to a cluster, but if I’m just fiddling around, I can do that on my laptop. That’s really cool. That’s something that people really, really like. There’s a lot of interest in the Spark connector just from that perspective. ‘Hey this is a handy little thing in my toolset that lets me do cool things that are useful and doesn’t take a huge investment.’ It’s really different from Hadoop in that way.”