Bank Replaces Hundreds of Spark Streaming Nodes with Kinetica
Kinetica got its start serving queries on fast-moving geospatial and temporal data for the Department of Defense. Now it’s moving into other industries with similar big data problems, including finance, where one large global bank replaced hundreds of nodes of Apache Spark with a handful of servers running Kinetica’s GPU-powered analytical database.
When it comes to serving queries on large amounts of fast-moving data, there are several challenges that must be overcome. One of the biggest challenges is keep up with extreme pace of data ingest, which generally rules out data warehouses, even the new generation of scale-out warehouses running in the cloud.
Customers generally have taken a roll-your-own approach to building real-time analytic systems with open source components. You will often see Apache Kafka as the underlying message bus, a data transformation and aggregation occurring in a pipeline constructed with the Apache Spark or Apache Flink framework, and a key-value store such as Redis or Cassandra, which is used to serve metrics.
Developers often prefer this approach because the open source frameworks and databases are well-established at this point, says Nima Negahban, the co-founder and CEO of Kinetica. “I understand that developers want full control, and that pipeline approach really puts it in their hands,” he says. “I think there’s a level of comfort there.”
But increasingly, customers are finding that the build-it-yourself approach with open source components is not giving them the results they want, Negahban says. Instead, with Kinetica, they’re finding that they can get what they need without straying too far outside of the comfortable confines of SQL and relational database (albeit one designed to run in a distributed manner atop GPUs).
“Either people build out these elaborate Spark pipelines and dump them into a key value [store] like Cassandra or Redis, or they try to simply [process] things as they come in and do some rule evaluation then,” says Negahban, a 2018 Datanami Person to Watch. “With Cassandra and Redis, you’re doing all this ETL, manual fusion and enrichment and rollup.”
Many organizations that eventually become Kinetica customers start out trying to roll their own real-time analytics system before finding that approach is rife with complexity, inflexibility, and inefficiency, Negahan says.
“What we see a lot with folks who use that build-out approach is they build out their first round of capability, and when they want to add or when they want to develop management tools that sit on to of a KV [store], they realize that they basically start making like a bad database,” he says. “You start putting different database operations at different layers of your stack. If you want to make an activities table or activities monitor, but you’re doing everything through a KV, now you have to do that roll up logic in your microservice to feed to your Web app.”
Instead of building database-like management capabilities into their cobbled-together collection of framework after the fact, some organizations are discovering they may be better off just selecting an actual database from the beginning. Customers may give up a little bit of control, Negahban concedes, but in return they get to concentrate on building better customer applications instead of managing the data and the infrastructure.
“That’s the key difference is you don’t have to spend all this time building out these fixed pipeline to build out this fixed capability, because inevitably what happens is you realize you need to make more tools around that fixed capability, and then eventually you need to add more capital and it just become this never-ending cycle where you’re just spending more and more development time, where you really should be thinking about your user facing capacity,” he says.
That’s basically what happened at a large Wall Street bank. According to Negahban, the bank needed to perform a temporal join against a high-speed stream of internal events. “They wanted to correlate events that were happening externally to events that were happening internally in their portfolio,” he explains.
The bank selected a streaming data framework in which to execute that high-speed join, and it paired that with the Redis key-value store. While the company got the setup to work, it was inefficient and the bank ultimately replaced it with Kinetica.
“We took out 700 nodes of Spark and Redis,” Negahban says. “They were spending tons and tons of money on hardware and hundreds of thousands, millions of lines of Spark-Flink type code. We were able to reduce that to several thousand lines of SQL.”
The key that makes it work is the speed of Kinetica’s vectorized database engine. Negahban and his co-founder, Amit Vij, developed their own custom join and group-by kernels to leverage the power of Nvidia GPUs. The database has since been adapted to run on Intel’s AVX-512 instruction set, which eliminates the need to for a GPU. But for some high cardinality data, which Kinetica’s largest customers have lots of, the GPU version remains the favorite.
“What it provides you is that ability to do that raw query processing that’s so important when you have data constantly flowing in, when you have more complex queries or high-cardinality group-bys or ad-hoc joins, which more and more we’re seeing the demand for,” Negahban says. “That’s why us having this very powerful brute-force engine really pays off, because we can do the more traditional stuff, but we also have the muscle in the back to blow through compute-intensive query operations.”
Customers can also use Kinetica as a key-value store paired with Flink or Spark Streaming, whereby the frameworks take the first pass at aggregating and shaping the data before it’s piped into Kinetica’s relational database to serve real-time metrics, Negahban says. But the better approach, he says, is to leverage the full capabilities of Kinetica without the extra step of building Spark Streaming or Flink pipelines.
“On top of that, you have full ad hoc queries,” he continues. “So where we excel is doing really complex OLAP as data continues to stream into your tables. So you don’t have to coordinate or micro batch or worry about any of that. You get that robust OLAP capability out of the box, and you also get that in tandem with distributed ingress and distributed egress.”
Kinetica is currently growing in excess of 30% per year, according to Negahban. In addition to customers in financial services and government, it’s also gaining share among retailers and distributors, who are looking for better ways to manage their inventory in a real-time fashion.
The Arlington, Virginia company recently bolstered its integration with Apache Kafka with the development of a native connector, eliminating the need for users to write their own. In 2022, the company will be focused on delivering a serverless offering running in the cloud, and an AWS Marketplace application for commercial and government cloud customers.