Follow Datanami:
January 10, 2017

Spark Gets In-Memory Boost

(ami mataraj/Shutterstock)

Apache Spark is getting an open source computing and storage boost with its integration with a widely used in-memory data platform.

Hazelcast Inc. said Tuesday (Jan. 10) its in-memory data grid adds connector support for Spark, giving developers access to open source tools for data storage and computing that the company says go beyond the limits of a single Java virtual machine.

The Hazelcast connector leverages a Spark API called Resilient Distributed Dataset (RDD), a distributed collection of data elements partitioned across cluster nodes to, among other things, provide parallel access to data. The combination of RDD with the Hazelcast in-memory grid are designed to serve as the basis for improved distributed computing required for large datasets.

“Any big data solution needs to be able to distribute processing and storage across machines whilst maintaining a flexible and convenient programming interface,” Hazelcast CEO Greg Luck stated in introducing the Spark integration. “Without these functionalities, it becomes impossible to build enterprise applications which are expected to process more and more data.”

Hence, the in-memory specialist based in Palo Alto, Calif., is positioning its Spark entry as an open source alternative for boosting data storage and distributing computing for data streaming, machine learning or crunching SQL workloads. All require fast iterative access to large datasets.

The Spark integration comes one year after Hazelcast released what it described as the “platform” version of its data grid that incorporates support for cloud management and application containers. The grid allows users to share and partition application data across installed clusters and servers.

The company also developed a sports betting application as a way of demonstrating the performance advantages of integrating Spark with its in-memory grid. The “bet engine” was designed to scale across multiple Java virtual machines with events shared across data grid partitions. The query engine used Spark to provide real-time risk and analytics. The combination of in-memory computing and distributed storage along with Spark’s query and analytics capabilities formed the basis for a future gaming application, the company claimed.

The code for the sports betting application is here.

Hazelcast touts the interoperability of its in-memory data grid with a range of programming languages, including Java, Python, R and Scala, which are also supported by Spark. Hence, the company said the combination of Spark and its in-memory data grid could be used across stacks based on multiple programming languages.

The integration also underscores how platform developers focusing on big data applications are gradually shifting from current technology such as Hadoop and Storm to Apache Spark’s real-time streaming data capabilities. At the same time, Hazelcast claims its data grid boosts the in-memory performance of applications running in Hadoop clusters.

The company said its in-memory data grid is being shipped as an open source connector in version 3.7 for use as a storage medium for Spark.

Recent items:

Unraveling Hadoop and Spark Performance Mysteries

Overcoming Spark Performance Challenges in Enterprise Hadoop Environments