Tabular Seeks to Remake Cloud Data Lakes in Iceberg’s Image
The creators of the table format Apache Iceberg launched a new company this summer called Tabular that’s aiming to remake how companies store data in the cloud. If the company has its way, much of the minutia of how data is stored in the data lake, as well as the going maintenance and optimization of that data, will be automated, taking a large burden off data engineers and data analysts.
Iceberg is an open data format that was originally designed at Netflix and Apple to relieve the limitations in using Apache Hive tables to store and query massive data sets using multiple engines. Hive was originally built as a distributed SQL store for Hadoop, but in many cases, companies continue to use Hive as a metastore, even though they have stopped using it as a data warehouse.
The number one goal of Iceberg was to ensure the correctness of data, since Hive offered no such guarantees, which caused havoc when multiple services and engines accessed and modified Hive tables. But Iceberg brings other benefits too, including addressing the small file problem, simplifying on-going maintenance of data, optimizing data access, and generally taking workload off the shoulders of overworked data engineers.
The goal with Tabular is to build a full data management service on top of Iceberg, says Ryan Blue, the co-creator of Iceberg and the co-founder and CEO of Tabular.
“What we’re thinking is sort of the data platform-level management that we were providing at Netflix, but for everyone,” Blue says. “Any company should be able to come and provision something that manages their data in the Iceberg format, in their bucket, and that works across any engine.”
As a senior engineer at Netflix, Blue created Iceberg with Dan Weeks, who was the engineering manager of big data compute at the streaming movie giant. The Iceberg tables were accessed by a variety of computational engines and services, including Presto, Trino, Spark, and Flink. Enabling that sort of compute-engine openness is the goal with Tabular, says Blue.
“We’re thinking that Tabular is going to be everything below the engine level–the metastore, the storage management, the services that maintain your data–all of those infrastructure components that are hard to build and maintain and run,” he tells Datanami. “Basically, the Netflix data platform without the compute layer, but as a hosted managed service.”
The Tabular team is working on the first protype, and Blue doesn’t expect the service to become available until early 2022. It will be offered first as a hosted service at AWS, followed by availability on other cloud platforms, he says. The company, which comleted its Series A round of funding from Andreessen Horowitz in July, is currently hiring.
The Iceberg table format is a good place to start when building a cloud-based data warehouse to house data in Parquet, ORC, and Avro formats. It provides the much-needed consistency to ensure that data doesn’t get out of whack. But it still requires data engineers to actively work with it and implement it, and that’s the element that Tabular is hoping to eliminate with its new service.
“I think of us as the bottom half of the database–that storage engine needs to keep track of what tables exist, where they are…is everything about that table,” Blue says. “We want to keep track of the data, keep track of how you’re using the data, and optimize that data for use across any number of these engines…whether you’re using Trino that you built yourself and you’re running in Kubernetes [or Spark on EMR]. We want to be that base layer that everything talks [with] to interact with your data.”
Before adopting Iceberg, Netflix leaned heavily on data engineers to construct and maintain the tables for downstream users. That required them to make a large number of decisions about the tables that impacted their usability, performance, and cost for Netflix, says Blue, who left the tech giant earlier this year to found Tabular alongside Weeks and Jason Reid, Netflix’s former director of data science and engineering.
“We had data engineers and we expected them to understand the tables that they were working with,” Blue says. “We’re exposing a lot of responsibility to data engineers there to understand all of those aspects. How is the table partitioned? What are my downstream consumers going to select on? And even things like what [sort] columns are going to make my data smaller? What’s a high cardinality column? All of those things should be something that we can get from the [Tabular] environment.”
DBA In a Box
Blue is taking the lessons he learned from Netflix’s approach to table management, and is seeking to basically automate the functions the data engineers did for Netflix with the Tabular service. In a way, it’s like an automated database administrator (DBA).
“One thing that Iceberg does is we’re making more and more things table configuration,” Blue says. “So the sort order. How do I want to cluster my data? What size of files do I want? Those sorts of things, you essentially declare in Iceberg as, this is my ideal state. Everything is sorted like this. Everything is in this format using these settings. Then that gives us a target to shoot for.”
For example, say a customer just wrote 10,000 files that are 5KB each into a table. “Well, that’s going to be terrible performance,” Blue says. “We can go apply your sort order and group your data correctly, rewrite it in the background fairly quickly, and make your operations more efficient without you having to have an expensive data engineer who understands how to make that happen in the first place.”
If a table doesn’t have an explicit sort order, the Tabular service will be able to infer a sort order based on the primary key of the table and the partitioning scheme, Blue says.
“We can also take a look at what are people actually doing in selecting from this table,” he says. “And if we know that, we can figure out, oh these are the columns that people tend to select by and we can fill that in.”
Eventually, Tabular could bring some AI to bear on the problem. For example, at Netflix, Blue helped implement a recommender system that would find the optimal settings for given table by rewriting it 20 or times and seeing which settings worked the best. The company could eventually build that sort of system. But first, the company is focused on getting the core service built and implemented.
“What we want is a very simple, easy, out-of-the-box solution that that works great with Iceberg tables,” Blue says. “And if you’re either a new customer, someone moving to cloud, or someone with an existing Hive build out, to just be able to start using our service very easily.”
Open Data Lakes
The way Andressen Horowitz’s Martin Casado desdribes it, Tabular is building “an independent cloud data platform.”
“It will replace raw data lakes with a service that hides much of the underlying complexity and automates common data management tasks,” Casado wrote in a recent blog post on the a16z website. “Tabular provides many of the features that make data warehouses easy to use — atomic transactions, schema evolution, time travel, partitioning, and so on — to any cloud-based data processing system that wants to support it, including data warehouses. In this sense, it implements the ‘lakehouse’ architectural pattern that is growing in popularity. But it adopts a fully open set of standards so that all systems can build on a common foundation and share data in a common format.”
First made popular with Hadoop, data lakes today are proliferating in the cloud, where they are implemented on S3 and other object storage systems. The combination of cheap storage and the separation of compute and storage means that companies can scale their data lakes well into the petabyte range. But there is complexity lurking in the lake that forces customers to become experts in “quirky limitations,” Tabular says.
The popularity of Snowflake and Databricks shows that there is a market for services that simplify data lake management. With Tabular, Blue hopes to mirror those types of services, while giving customers freedom to plug any engines into their big data storage.
“Snowflake has definitely built the bottom half of the database that is quite good,” Blue says. “Databricks has built the bottom half of the database as well, with their Delta lake format.”
Both Databricks and Snowflake will run maintenance services on customers data, such as compacting data to minimize storage costs. However, both those two services aim to keep you within their respective ecosystems, Blue says.
“What we want is to be agnostic to the query engine,” he says. “We want both Databricks and Snowflake and Starburst and whoever to work with us and be able to access that data natively and with really great performance…We think that people want a lot of flexibility in the query engine, and without needing to move their data.”