TikTok Parent Open Sources Real-Time Data Warehouse
You might not yet be a major TikTok influencer, but you can still analyze data like TikTok’s parent company, ByteDance, which recently released its real-time data warehouse architecture as open source.
ByConity, the name of ByteDance’s data warehouse, is an elastically scalable, column-oriented relational database that’s based on ClickHouse, the scalable, open-source database that the Russian media giant Yandex created in 2009 and spun out into its own company in 2021. ByteDance, which owns TikTok, implemented ClickHouse in 2018 to process batch and real-time data.
When the TikTok app took off globally later that year, the volume of data flowing into ClickHouse skyrocketed, leading to growing pains with the data warehouse. The main culprit, according to a May 24 blog post by ByConity maintainer Vini Jaiswal, was ClickHouse’s shared-nothing architecture, which prevented the company from scaling storage and compute independently and assuring high levels of query performance.
“ClickHouse’s tightly coupled architecture led to interactions among multiple tenants in a shared cluster environment,” Jaiswal wrote. “Since reading and writing operations were performed on the same node, they often interfered with each other, impacting overall performance.”
The company decided to upgrade the underlying architecture of ClickHouse and began the internal ByConity project in 2020. The core element separating ByConity from its predecessor was the implementation of centralized data storage, which allowed for the separation of compute and storage nodes in the cluster.
“This transformation results in stateless computing nodes, enabling dynamic expansion and contraction by leveraging the scalability of distributed storage and the stateless nature of computing nodes,” Jaiswal wrote.
A byproduct of the separation of compute and storage is multi-tenant resource isolation, which enables a single ByConity implementation to be shared by multiple users without impacting performance. This makes it suitable for running in the cloud.
ByConity, which is developed in C++ (same as ClickHouse), delivers strong consistency of data read and write operations, Jaiswal wrote. “This ensures that data is always up-to-date and eliminates any inconsistencies between read and write operations, guaranteeing data integrity and accuracy,” she wrote.
The ByConity team adopted elements common with other OLAP engines, including column-oriented storage, vectorized execution, MPP execution, and query optimization, according to Jaiswal. It selected FoundationDB, an open source key-value store owned by Apple, for storing metadata. Meanwhile, a virtualized approach to file storage allows ByConity users to adopt object storage like S3 or the Hadoop Distributed File System (HDFS) as the underlying storage mechanism.
When a ByComity user submits an SQL query, it kicks off a series of processes inside the distributed database. The query is routed through a query analyzer and a query optimizer to develop a query plans that are either cost-optimized or rules-based. The query plan is then routed to a scheduler, which accesses the resource manager to determine which nodes will execute the query.
Worker nodes then execute the query according to query plan. The queries may be routed to distinct computing resources, which helps to enforce multi-tenant isolation, Jaiswal wrote. The database adheres to the precepts of ACID for maintaining transactional integrity, she wrote.
ByComity supports several deployment scenarios. Users can download binaries for running ByComity in a standalone Docker container, deployed as a distributed cluster atop Kubernetes, or deployed on physical machines. Users can also download the ByComity source code to compile as they like.
You can download ByComity and access other open source resources at github.com/ByConity.