Aerospike Turbocharges Spark ML Training with Pushdown Processing
Companies that need to access a lot of data in a hurry, such as retraining a machine learning model in Spark, have traditionally had to move that data from the edge to a central repository, such as a cloud data lake. But with the new pushdown processing capability in Aerospike’s NoSQL database, companies can access hundreds of terabytes stored in Aerospike from Apache Spark in just a few hours, enabling them to retrain their ML models several times per day.
Aerospike founder and Chief Product Officer Srini Srinivasan gave Datanami the lowdown on the high tech in version 5.6 of its NoSQL database, which the company is unveiling today during its annual user conference (still virtual due to COVID).
It all starts with sets, which are logical groupings of data within a single Aerospike namespace. Support for indexing of sets dramatically lowers the amount of data required to be scanned during a query. When a set is indexed, the database engine doesn’t have to scan the entire namespace, which could be over a petabyte.
“Therea are millions of items in a set, compared to hundreds of billions in the namespace,” Srinivasan says. “You would reduce the time it takes to access the set.”
Secondly, version 5.6 adds support for read and write operations on expressions, which act like filters that further reduce the amount of data the query engine needs to search to find the answers.
The combination of the set indexes and operations in expressions results in a powerful new capability that allows users essentially to do pushdown processing of data from external systems into the Aerospike database, Srinivasan says.
“We have increased the scale you can handle in the database, because now you can handle a lot more sets, and you can run a lot more set-oriented queries,” he says. “We don’t have to go look at all the data to find individual sets, so that reduces the resources in the database.”
Prior to this release, users would have written user defined functions (UDFs) in the Lua language to access data remotely in this manner. With support for operations on expressions, those queries are not only moved closer to the database, but they’re executed as C code, which speeds up processing.
The speedup in data processing for large databases is substantial, Srinivasan says.
“Before we could have run maybe tens of queries per second,” he says. “You can probably run thousands now. It’s a couple of orders of magnitude improvement when you have small sets in a large namespace in terms of how you can run it.”
One of the most obvious use cases for this new functionality retraining a machine learning model in Spark using data collected in Aerospike. The database already offered a connector for Spark version 3.0. With the new push-down query advancements, Spark is now able to access a lot more data from Aersopike.
“It’s about scale,” , Srinivasan says. “When you run a Spark process with processing logic…you’re able to have N-way parallelism. Spark allows up to 32,000 worker threads. You can align that with sets scans in Aersospike, which means you’re now able to scan it faster because each one of those set scan is more efficient now, a couple of orders of magnitude faster.
This will bring a dramatic speedup for Spark customers, according to Srinivasan. A machine learning job in Spark that needs to access 100 TB of data residing in Aerospike can now complete that job in just a couple of hours, he says.
“You can actually process more data that fit in memory in Spark because Aerospike is going to provide the data as you scan it,” he explains. “Now this scan can happen in parallel across 32,000 Spark worker threads going again 32,000 sub-scans inside Aerospike.”
This is a potential game-changer for edge analytics use cases, where customers are ingesting large amounts of new data and processing data updates in the Aerospike NoSQL database. Thanks to the database’s cross-data replication (XDR) functionality, data in a remote Aerospike database can be quickly replicated to a larger database cluster running on-prem or in the cloud. This cluster, when integrated with a large Spark cluster running alongside it, can now keep that Spark cluster fed at a rate that was previously not possible.
“The good news is, as data is changed on the edge, more and more data keeps appearing in the system of record using XDR,” Srinivasan says. “Which means, if you’re generating a model every few hours, it can take advantage of the latest data in Aerospike without having to copy it elsewhere.”
Some folks may think that databases are archaic remnants of a bygone IT era, unable to keep up with current big data demands. But the folks at Aersospike would beg to differ, as they have taken steps to keep their database management system running as fast as possible.
In addition to IoT use cases, Aerospike customers in financial services could benefit from this new capability. Fraud detection is one use cases. Another potential customer is a brokerage house that can now update their risk analysis models on a more frequent basis, enabling them to better understand their current positions and to take advantage of sudden changes in the market.
Push down processing will benefit customers using Aerospike in conjunction with Spark. It will also benefit customers using it with Presto.
“The basic concept is pushing down the query into Aerospike from an external system. We can do it from many systems,” Srinivasan says. “Presto runs distributed queries, so running a query, Aerospike can now participate in that query with set indexes.”
Many Presto customers are querying data in S3 or other data lake systems. But for Aerospike customers, that would require moving data from their system of record into S3 to take advantage of query engines like Presto (or Spark). Eliminating that costly data movement is what Aersopike 5.6 is all about.
“Virtually every Aerospike customer uses Spark. Many of them use Presto,” Srinivasan says. “Typically what they have to do now is copy the data into S3 to use it.
“Now we’re saying, you don’t have to do any of that,” Srinivasan says. “You just use it directly from Aerospike. And it’s fast enough that you won’t even see the impact on it from your system of record.”
Aerospike is currently running its Aerospike Digital Summit 2021 this week. For more information, see https://aerospike.com/summit/.