June 30, 2014

See Spark Run on NoSQL, DataStax Says

Alex Woodie

DataStax today announced that Apache Spark is included in the latest release of its NoSQL databases, including open source Cassandra and DataStax Enterprise version 4.5. Those databases now include the in-memory Spark tools, thereby giving customers a new option for running analytic workloads on transactional data.

Apache Spark has garnered loads of attention as a potent analytic toolset for Hadoop. Developers are eager for Spark, which offers a single API for accessing a treasure trove of machine learning, graph, SQL, and real-time streaming functionality. With support for Scala, Python, and other languages, the in-memory architecture is viewed as a replacement for MapReduce and a key part of the modern Hadoop v2 stack.

Of course Hadoop is not a requirement for Spark. When the technology was originally developed at Cal Berkeley’s AMPLab, running it on Hadoop was not the primary goal. But just the same, Spark today is typically grouped in with the big yellow elephant. So when DataStax announced it’s the first non-Hadoop data store provider to get Spark-certified by Databricks, the company behind Spark, it turned a few heads.

Running Spark on DataStax is all about letting customer run some analytic routines on transactional data stored in the NoSQL database, says DataStax vice president of products Robin Schumacher. You’re not going to replace your Hadoop or Teradata installation with running on Cassandra and DataStax Enterprise (DSE), but it will come in handy just the same.

“You need to be able to run analytics on your hot transactional data. You don’t have to be in a data warehouse to have analytic needs,” Schumacher tells Datanami. “We needed a technology that allows us to run analytics across a distributed, share-nothing architecture, and we’re able to do that with Spark.”datastax

Spark can utilize Cassandra or DSE as a data source or a target, Schumacher says. “Spark makes use of RDDs, or resilient distributed data sets, and Cassandra can serve as one of those,” he says. “So you’re able to take advance of Spark’s distributed in-memory and disk scale-out processing, but rather than target it toward HDFS or Hadoop installations, you can run that now on top of Cassandra.”

DataStax unveiled its own in-memory option for speeding up processing of transactional data earlier this year. With DSE 4.5, Spark can now operate on the in-memory transactional tables, Schumacher says. “So you can be doing transactional operations in in-memory table against Cassandra, then query that table and run analytics on that table in memory in Spark,” he says. We’re able to deliver a full in-memory solution for both transactional and analytic workloads.”

“Think of it as “dial-a-performance” knob, he says. If a particular transactional or analytic job doesn’t have a need for speed, it can use good old spinning disk. As the need for speed increases, they can start using special optimization and configuration parameters developed for solid state disks. And then there’s in-memory processing when the speed limit hits the red zone.

Support for Spark was driven by DataStax’s customer base. Some of its customers started experimenting with Spark on the NoSQL database, and let the company know there was real potential there. DataStax engineers bolstered the integration by developing several add-ons, including a connectivity layer, a data-type mapping layer, and performance optimizations.

All of these DataStax-developed pieces are going into the open source Cassandra project, as well as DSE. Without the performance optimizations, queries can take up to 10 times longer, Schumacher says. The company is keeping some Spark-specific features in the commercial DSE product, including high availability failover and push-button management and deployment capabilities.

DataStax has always had one toe in the Hadoop world, so it’s shouldn’t come as a surprise that it was the first NoSQL database vendor to embrace Spark. Since the first version of Cassandra, the product has included MapReduce, Hive, Pig, and Mahout machine learning capabilities. Now the addition of Spark gives Cassandra and DSE customers even more options for performing real-time analytics.

DataStax isn’t supporting some of the newer Spark capabilities, such as Spark Streaming and Graphx yet,” Schumacher says. Instead, it’s encouraging customers to confine their Spark usage primarily to general Spark functionality, Shark SQL, and the MLlib machine learning libraries to a lesser extent. “We do have a formal partnership with Databricks, and we will be working with them to roll out additional Spark functionality in an upcoming release,” he says.

The push and pull among Hadoop, Spark, and NoSQL continues with another new feature in DSE 4.5, namely the capability to link data objects between the NoSQL database and data objects that exist in Hortonworks and Cloudera Hadoop clusters. This will give DSE and Cassandra customers the capability to join historical data sitting in Hadoop with transactional data sitting in the NoSQL database, and either push that data into NoSQL or Hadoop for analysis.

Related Items:

DataStax and Databricks Partner

Datastax Seeks to Put NoSQL Clusters on Autopilot

NoSQL Databases RAM it Home with In-Memory Speedups