Follow Datanami:
May 2, 2024

LakeChime: A Data Trigger Service for Modern Data Lakes

May 2, 2024 — LinkedIn has announced the launch LakeChime, a new data trigger service designed to streamline data management within modern data lakes. LakeChime addresses significant infrastructure challenges associated with managing large-scale data by offering a unified solution that simplifies data triggers across both traditional and modern table formats.

In this blog post, Walaa Eldin Moustafa, Senior Staff Software Engineer at LinkedIn, dives into the details of this new service.


Enriched by its vast array of members, skills, organizations, educational institutions, job listings, content, feed, and other interactions, the LinkedIn platform creates billions of data points. This data fuels various applications, from recommendations and rankings to searches, AI functionalities, and other elements enhancing the member experience. All this information is stored in a massive data lake, which allows for quick, scalable, and effective access and processing of extensive datasets. However, this also poses significant infrastructure challenges in gathering, managing, and utilizing this extensive data collection.

To that end, hundreds of thousands of data pipelines execute daily on our data lake, continually consuming data and writing back insights for further processing by downstream pipelines. Executing pipelines as soon as data is available is crucial for making timely insights; this is easily facilitated by building data triggers that signal the availability of data.

These data trigger primitives are typically interrelated to a data lake’s metadata, since most often the metadata is updated once the data is ready to process. In turn, metadata is interrelated with the data lake’s table format. Therefore, table formats used in the data lake play a significant role in deciding data trigger primitives and semantics. In the next section, we discuss table formats, their history, and interrelationship with data triggers in more detail.

The Evolution and Impact of Table Formats on Data Trigger Mechanisms

Table formats define the structure and organization of data within a data lake, specifying how data is stored, accessed, and managed. Until recently, the Apache Hive table format has been the format of choice for data lakes. In the Hive table format, data is organized into directories corresponding to partitions. The registration of partition metadata conventionally served as the signal of data arrival and the trigger for executing data pipelines. However, this convention suffered from significant gaps:

  • Coarse Granularity: Data consumption was constrained by the granularity of partition creation. For instance, if partitions were created daily, consumers could only schedule daily jobs to consume new partitions.
  • Partial Data Consumption: Partitions are created once, but the data within them can continually be updated. This forces data pipeline owners to choose between registering partitions late (to maximize data visible once the partitions are registered but sacrifice on latency), or registering early (to achieve low latency, but sacrifice on the completeness of data available at registration time).

As data lakes evolve towards modern table formats like Apache Iceberg, Delta, and Apache Hudi, new metadata primitives are becoming more mainstream. One breakthrough that these data formats provide is ACID transactions and semantics. They introduce the concept of “snapshots” to express units of data change. Such units of change represent a much more granular level of metadata, and address the gaps in having to consume partial data, as was the case with the Hive table format. However, some challenges remain as to:

  • How to present the abstraction of data triggers as a concept to the user, decoupling them from the underlying metadata representation differences, whether they are between Hive and the other formats, or among the other formats.
  • How to migrate a data lake that predominantly relies on Hive partition semantics for data triggers to a data lake that is powered by modern table formats, and whose data arrival semantics depend on partitions.
  • How to handle the scale, throughput, and latency of such metadata requests in modern table formats. Hive metadata is powered by a scalable MySQL backend, while most modern table formats store metadata in distributed file system files with structured data, such as Avro or JSON.

Introducing LakeChime: A Unified Data Trigger Solution

In this blog post, we introduce LakeChime, a data trigger service that unifies data trigger semantics not only among modern table formats, but also between modern and traditional table formats such as Hive, bridging the impedance mismatch between traditional partition semantics and modern snapshot semantics.

At LinkedIn, we use LakeChime to support data triggers for Hive as well as Iceberg tables (maintained at LinkedIn by OpenHouse, the table catalog and control plane). Further, LakeChime is used as one of the main ways to streamline the migration from Hive to Iceberg through its data trigger compatibility layer.

LakeChime supports both types of data triggers: classical partition triggers, triggering workflows based on the availability of partitions, and modern snapshot triggers, triggering workflows based on the availability of new data snapshots. Further, LakeChime is powered by an RDBMS backend, making it ideal to handle large-scale data triggers in very large data lakes. Specifically, LakeChime unlocks the following use cases:

  • Backward Compatibility with Hive: LakeChime provides backward compatibility with Hive by supporting partition triggers for all table types, including modern table formats, at scale.
  • Forward Compatibility with Modern Table Formats: LakeChime offers forward compatibility with modern table formats by facilitating snapshot trigger semantics for all table types, including the Hive table format, at scale.
  • Simpler Data Lake Migrations: LakeChime is an essential component to unlock the migrations of data lakes from the Hive table format to the modern formats. It abstracts away the metadata implementation details, and provides a compatibility layer for the data trigger aspects through its forward and backward compatibility.
  • Benefits of Snapshot Triggers: Snapshot triggers are a step up in UX compared to traditional partition triggers because they enable both low-latency computation and the ability to catch up on late data arrivals.
  • Incremental Compute: LakeChime unlocks incremental compute at scale when the underlying table format supports incremental scans, bridging the gap between batch and stream processing, and paving the path to smarter and more efficient compute workflows.
  • Ease of Integration: LakeChime is easily integrated with data producers, consumers, and data scheduling systems (e.g., Airflow or dbt) to trigger pipelines upon the availability of data.

In the following sections, we explore the inner workings of LakeChime, illustrating its integration with the popular scheduling platform, Airflow. We’ll also offer a comprehensive demonstration of the user experience, showcasing how LakeChime, Airflow, and Iceberg collectively facilitate incremental computing on Iceberg tables.

Data Change Event: The Foundation of LakeChime’s Data Trigger System

At the core of LakeChime’s data trigger system lies the Data Change Event, or DCE. DCEs capture the concept of data changes within a data table. DCEs are registered by data producers upon updates. Data consumers, often orchestrated through frameworks like dbt or Airflow, consume these events to propagate changes downstream, which, in turn, emit new DCEs. Notably, data producers encompass a variety of platforms, including data ingestion platforms, compute engines, or table catalogs.

Click here to continue reading.


Source: Walaa Eldin Moustafa, 

Datanami