Real-Time Analytics Databases Emerge to Take On Big, Fast-Moving Data
A new product category is emerging in the analytics field to deliver timely queries atop very big and very fast-moving data. The name hasn’t been nailed down yet, but one of the leading providers in the space calls its product a real-time analytics database.
Once you’ve reached the limits of what a traditional data warehouse like Snowflake, BigQuery, or Redshift can do, you may step up into a more exotic line of distributed systems. The leaders in this space–Apache Druid, ClickHouse, and Apache Pinot–aren’t exactly new, but they are seeing a surge of interest as data volume and velocity continues to build, and the window of opportunity to act on the data continues to get smaller.
These databases are united not so much in the technology they use, but in what capabilities they can deliver. They all excel at executing complex OLAP-style SQL queries against very large amounts of fast-moving data, for a large number of users, and returning the results in a short amount of time (usually sub-second).
One of the people watching this space is David Wang, the vice president of product and technical marketing at Imply, the company behind Apache Druid. Wang says it’s been fun to see how Druid, Clickhouse, and Apache Pinot have competed in the emerging market for real-time analytics databases.
“I think that’s really exciting because everybody has always thought of analytics as BI and the classical executive style reporting and Tableau dashboards,” Wang told Datanami in a recent interview.
“But this whole new world of developers are building applications and they’re building analytics applications,” he said. “If you look at this category that we represent, it’s encompassing of Apache Druid, ClickHouse, Apache Pinot. There’s kind of a new wave of really fast, real-time analytic databases that are serving this new use case.”
The term “real-time” is vague and can have multiple meanings, Wang acknowledged. For example, it can refer to the pace at which new data is being generated, where it’s sometimes a synonym for streaming data. On the other hand, real-time can refer to the latency of the queries and the speed at which the user gets results. But it doesn’t really matter in the end, because Druid can check both of those boxes, Wang said.
“There is this intersection point on the Venn diagram when you’re trying to do real analytics, but do it at the speed, the concurrency, and the operational nature of events–then you’ve got to have something that’s purpose-built for that intersection, and I think that’s where this category has emerged,” he said.
A better way to think about real-time analytic databases like Druid is what niche they fill. According to Wang, this new class of analytics database are serving an emerging need for analyzing the massive amounts of fast-moving data being generated by online applications.
Druid customers like Netflix, Target, and Cisco’s ThousandEyes have these types of fast-moving analytic problems. So does Sovrn, the ad-tech firm that adopted a hosted version of Apache Pinot from StarTree, and which we recently profiled. So does Yandex, the Russian search giant that developed ClickHouse and then spun it out into its own company in September 2021.
“Druid was built for the intersection of analytics and applications,” Wang said. “Analytics always represented large-scale aggregations and group-bys and big filtered queries, but applications always represented a workload that means high concurrency, operational data. It has to be really, really fast and interactive.”
ClickHouse, StarTree, and Imply may not have the same mindshare as Snowflake or Databricks. But among technologists who needed established products to solve challenging analytics challenges, they’ve already proven their worth. Expect to see more development in this emerging product category in the coming months and years.