Speedy Column-Store ClickHouse Spins Out from Yandex, Raises $50M
Russian search giant Yandex this week announced that it has spun out its distributed column-oriented analytic database ClickHouse into its own company. Based in New York City, ClickHouse Inc. also was given $50 million in Series A capital to jumpstart its business.
Moscow-based Yandex started developing the ClickHouse database in 2009, and it was put into service several years later the OLAP backend for its Yandex.Metrica Web analytics service. The database’s main advantage was the ability to continually process large amounts of data at scale with relatively low latency, which continues to be a technical challenge facing organizations with big data requirements.
By storing data in pre-aggregated columns and using other techniques–including compression, vector calculations, and the capability to scale linearly, among others–ClickHouse was able to reach the upper echelons of performance. According to Yandex, ClickHouse is able to scan hundreds of millions of rows (representing tens of gigabytes) per second, which enables users to run SQL queries on petabyte-scale datasets with sub-second latencies. That is 100x to 1,000x faster than traditional databases, the company claims.
In a blog post, ClickHouse co-founder and CTO Alexey Milovidov, the original creator of ClickHouse, discussed the history of the database and the source of its technological advantage.
“The most notable advantage of ClickHouse is its extremely high query processing speed and data storage efficiency,” Milovidov wrote. “In previous generation data warehouses, you cannot run interactive queries without pre-aggregation; or you cannot insert new data in real time while serving interactive queries; or you cannot just store all your data. With ClickHouse, you can keep all records as long as you need and make interactive real-time reporting across the data.”
What’s the secret sauce that makes ClickHouse so fast? According to the “distinctive features” section of the ClickHouse website, the database’s avoidance in storing extra values and storing data as primary keys, as a “true” column-oriented database does, as key aspects of its advantage. (It’s also refreshing to see the comapny admit the disadvantages of its approach, including no full-fledged transactions and no support for updates, save for some batch update and delete functions for complying with GDPR.)
According to Milovidov, there isn’t one single thing. “…[T]here is no single ‘silver bullet,’” he writes. “The main advantage is attention to details of the most extreme production workloads.”
Soon after implementing ClickHouse at Yandex.Metrica, it was adopted by much of Yandex, which is the largest Internet company in Europe with more than 14,000 employees. At that point, Milovidov says he knew the software needed to be more widely adopted.
“Maybe ClickHouse is too good to run only inside Yandex?” he wrote in the blog. “Doing open source is hard, but it is a big win. While it takes a tremendous effort and responsibility to maintain a popular open-source product, for us, the benefits outweigh all the costs.”
In 2016, Yandex released ClickHouse as an open source offering using Apache License 2.0. That led to exponential growth and adoption by thousands of companies around the world, including Uber, Comcast, eBay, and Cisco, according to Yandex.
Some of the customer adoption stories are compelling. For example, Uber adopted ClickHouse as its core logging platform for handling millions of logs per second from thousands of services, representing several petabytes of data in service. According to its February 2021 writeup, Clickhouse delivered a 10x performance boost over its ELK (Elastic, Logstash, Kibana) implementation.
Spotify, meanwhile, used ClickHouse to power its A/B testing regimen in its Google Cloud-based log management system, which replaced 2,500-node Hadoop cluster. The company needed to be able to run hundreds of queries per second, representing hundreds of billions of rows per day. In choosing ClickHouse over BigQuery, it cited the simplicity of architecture, a comprehensive set of built-in functions and aggregations, and Superset integration, among other reasons.
Deutsche Bank adopted ClickHouse as the basis for its data warehouse, which served a variety of use cases, including regulatory compliance, risk, trades, and know-your-customer initiatives. According to this presentation, the bank had tried multiple other databases including KDB+, Vertica, Hive, and Spark. Today it has settled on a combination of Spark, Alpakka, Kafka, Tableau, RShiny and Clickhouse to power its queries.
“The diversity of ways companies are deploying ClickHouse is incredibly compelling and speaks volumes to the strength of the technology,” says ClickHouse co-founder and president of product and engineering Yury Izrailevsky, who left his job as vice president of engineering at Google and will lead product development at ClickHouse. “Forming ClickHouse Inc. will allow us to focus on making the product even more powerful, especially when deployed in cloud environments.”
Milovidov and Izrailevsky are joined by Silicon Valley veteran Aaron Katz, who is the CEO and a co-founder of the New York City company. Mike Volpi, Partner at Index Ventures, which co-led the round along with Benchmark, sees something in ClickHouse that reminds him of other high-flying technology firms.
“We’ve been early believers and investors in data infrastructure at Index, and have been fortunate to work with leaders like Elastic, Confluent, and Datadog since their early days,” Volpi says. “It is clear that ClickHouse has a similar exciting trajectory, given its impressive adoption and community buzz.”