Iceberg Data Services Emerge from Tabular, Dremio
Data professionals with plans to build lakehouses atop the Apache Iceberg table format have two new Iceberg services to choose from, including one from Tabular, the company founded by Iceberg’s co-creator, and another from Dremio, the query engine developer that is holding its Subsurface 2023 conference this week.
Apache Iceberg has emerged as one of the core technologies upon which to build a data lakehouse, in which the scalability and flexibity of data lakes is merged with the data governance, predictability, and proper SQL behavior associated with traditional data warehouses.
Originally created by engineers at Netflix and Apple to deal with data consistency issues in Hadoop clusters, among other problems, Iceberg is emerging as a defacto data storage standard for open data lakehouses that work with all analytics engines, including open source offerings like Trino, Presto, Dremio, Spark, and Flink, as well as commercial offerings from Snowflake, Starburst, Google Cloud, and AWS.
Ryan Blue, who co-created Iceberg while at Netflix, founded Tabular in 2021 to build a cloud storage service around the Iceberg core. Tabular has been in a private beta for a while now, but today the company announced that it is now open for business with its Iceberg service.
According to Blue, the new Tabular service basically works as a universal table store running in AWS. “It manages Iceberg tables in a customer’s S3 bucket and allows you to connect up any of the compute engines that you want to use with that data,” he says. “It comes with the catalog you need to track what tables and metadata are there, and it comes with integrated RBAC security and access controls.”
In addition to bulk and streaming data load options, Tabular provides automated management tasks for maintaining the lakehouse going forward, including compaction. According to Blue, Tabular’s compaction routines can shrink the size of customers’ Parquet files by up to 50%.
“Iceberg was the foundation for all of this and now we’re just building on top of that foundation,” says Blue, a Datanami 2022 Person to Watch. “It’s a matter of being able to detect that someone wrote 1,000 small files and clean them up for them if they’re using our compaction service, rather than relying on people, data engineers in particular, who are expected to not write a thousand small files into a table, or not write pipelines that are wasteful.”
Tabular built its own metastore, sometimes called a catalog, which is necessary for tracking the metadata used by the various underlying compute engines. Tabular’s metastore is based on a distributed database engine, and scales better than the Apache Hive metastore, Blue says. “We’re also targeting a lot better features than what’s provided by the Hive metastore or wire-compatible Hive metastores like Glue,” he says.
Tabular’s service will also protect against the ramifications of accidentally dropping a table from the lakehouse. “It’s really easy to be in the wrong database, to drop a table, and then realize, uh oh, I’m going to break a production pipeline with what I just did!” Blue says. “How do I quickly go and restore that? Well, there is no way in Hive metastore to quickly restore a table that you’ve dropped . What we’ve done is we’ve built a way to just keep track of dropped tables and clean then up… That way, you can go and undrop a table.”
Blue, who spoke today during Dremio’s Subsurface event and timed the launch of Tabular to the event, describes Tabular as the bottom half of a data warehouse. Users get to decide for themselves what analytical engine or engines they use to populate the upper half of the warehouse, or lakehouse.
“We’re purposefully going after the storage side of the data warehouse rather than the compute side, because there’s a lot of great compute engines out there. There’s Trino, Snowflake, Spark, Dremio, Cloudera’s suite of tools. There’s a lot of things that are good at various pieces of this. We want all of those to be able to interoperate with one central repository of tables that make up your analytical data sets. We don’t want to provide any one of those. And we actually think it’s important that we separate the compute from the storage at the vendor level.”
Users can get started with the Tabular service for free, and are free to use it until the 1TB limit is hit. Blue says that should give testers enough time to familiarize themselves with the service, see how it works with their data, and “fall in love” with the product. “Up to 1TB we’re managing for free,” he says. “Once you get there we have base, professional, and enterprise plans.”
Tabular is available only on AWS today. For more information see www.tabular.io and Blue’s blog post from today.
Dremio Discusses Arctic
Meanwhile, Dremio is also embracing Iceberg as a core component of its data stack, and today during the first day of its Subsurface 2023 conference, it discussed a new Iceberg-based offering dubbed Dremio Arctic.
Arctic is a data storage offering from Dremio that’s built atop Iceberg and available on AWS. The offering brings its own metadata catalog that can work with an array of analytic engines, including Dremio, Spark, and Presto, among others, along with automated routines for cleaning up, or “vacuuming” Iceberg tables.
Arctic also provides fine-grained access control and data governance, according to Tomer Shiran, Dremio’s founder and chief product officer.
“You can see exactly who changed what, in what table and when, down to the level of what SQL command has changed this table in the last week,” Shiran says, “or was there a Spark job and what is the ID that changed the data. and you can see all the history of every single table in the system.”
Arctic also enables another feature that Dremio calls “data as code.” Just as Git is used to manage source code for computer programs and enable users to easily roll back to previous versions, Iceberg (via Arctic) can enable data professionals to work more easily with data.
Shiran says he’s very excited about the potential for data as code within Arctic. He says there are a variety of obvious use cases for treating data as code, including ensuring the quality of ETL pipelines by using “branching;” enabling experimentation by data scientists and analysts; delivering reproducibility for data science models; recovering from mistakes; and troubleshooting.
“At Dremio, in terms of our product and technology, we’ve worked very hard to make Apache Iceberg easy,” Shiran says. “You don’t really need to understand any of the technology.”
Subsurface 2023 continues on Thursday, March 2. Registration is free at www.dremio.com/subsurface/live/winter2023.
Open Table Formats Square Off in Lakehouse Data Smackdown
Snowflake, AWS Warm Up to Apache Iceberg
Apache Iceberg: The Hub of an Emerging Data Service Ecosystem?
March 30, 2023
- Zenoss Launches Free Trial for Kubernetes Monitoring with No Download Required
- Hightouch and Fivetran Accelerate Modern Data Stack Adoption, Exceed 230 Shared Customers
- UNESCO Calls on All Governments to Implement AI Global Ethical Framework Without Delay
- Salesforce Announces New Automotive Cloud Features
- mParticle Launches Warehouse Sync
- Dremio and Domo Announce New Integration to Expand Data Lakehouse Access
- Algolia Introduces New Developer-Friendly ‘Build’ Pricing Plan
- MariaDB’s New SkySQL Release Reimagines How Companies Control Cloud Database Spend
March 29, 2023
- BigID Prepares Organizations for CPRA Compliance with Automated Data Privacy Suite
- Lenovo: Only 15% of Businesses Considered Data Leaders, as Organizations Strive to Enhance Data Strategies to Keep Up with Competitors
- Quantic Selects Couchbase Capella to Scale Point of Sale Platform
- NetApp’s 2023 Cloud Complexity Report Highlights Shifting Demands of a Multicloud Environment
- Alation Secures 2 New Procurement Contracts to Meet Public Sector’s Demand for Data Intelligence
- IBM Cloud and Wasabi Partner to Power Data Insights Across Hybrid Cloud Environments
March 28, 2023
- Sinequa Integrates ChatGPT with Its Neural Search Engine
- D2iQ Brings Cloud-native Deployment Capabilities to Public Sector with New DKP Gov Kubernetes Management Platform
- New Academic Program Lowers Cost for University Researchers to Access Leading-edge ML Method
- Virtana Releases Latest State of Multi-Cloud Management Report
- Hitachi Vantara and Golden Grove Nursery Use Data-driven Analytics for More Sustainable Water Management
- Datametica Brings Pelican SaaS-based Tech to the Google Cloud Marketplace
Most Read Features
- Databricks Bucks the Herd with Dolly, a Slim New LLM You Can Train Yourself
- Prompt Engineer: The Next Hot Job in AI
- Data Mesh Vs. Data Fabric: Understanding the Differences
- Iceberg Data Services Emerge from Tabular, Dremio
- Hallucinations, Plagiarism, and ChatGPT
- Open Table Formats Square Off in Lakehouse Data Smackdown
- GPT-4 Has Arrived: Here’s What to Know
- What Does It Mean for a Data Catalog to Be Powered by a Knowledge Graph?
- Apache Pinot Uncorks Real-Time Data for Ad-Tech Firm
- ChatGPT Brings Ethical AI Questions to the Forefront
- More Features…
Most Read News In Brief
- Observability Primed for a Breakout 2023: Prediction
- Mathematica Helps Crack Zodiac Killer’s Code
- Multi-modal GPT-4 Rumored To Be Released This Week
- Bill Gates Says the Age of AI Has Begun, Bringing Opportunity and Responsibility
- OpenXLA Delivers Flexibility for ML Apps
- Open Letter Urges Pause on AI Research
- Observability Overload: Grafana Labs Survey Builds a Case for Centralized Solutions
- Meta Releases LLaMA Foundation Language Models to Researchers
- Google Cloud’s 2023 Data and AI Trends Report Reveals a Changing Landscape
- Big Growth Forecasted for Big Data
- More News In Brief…
Most Read This Just In
- Colossal-AI Releases Open Source Framework for ChatGPT Replication
- Former Cloudera CPO & Hortonworks Cofounder, Arun Murthy, Joins Scale AI
- Salesforce Launches Hyperforce EU Operating Zone
- AWS Announces General Availability of Amazon OpenSearch Serverless
- Comet Releases MLOps Industry Report | 2023 Machine Learning Practitioner Survey
- Esri Releases New App to Easily View and Analyze Global Land-Cover Changes
- Akkio Launches Chat Explore Powered by GPT-4
- Salesforce Announces Einstein GPT
- Vultr Announces Availability of NVIDIA H100 Tensor Core GPU and Partnerships with Domino Data Lab and Anaconda
- Google Cloud and Accenture Expand Strategic Partnership, Announce Platform Tech Integration
- More This Just In…
Sponsored Partner Content
AI in Finance Summit NYApril 20 - April 21New York NY United States
CDAO Spring 2023April 25 @ 8:00 am - April 26 @ 5:00 pmSan Francisco CA United States
AI & Big Data Expo North America 2023May 17 @ 8:00 am - May 18 @ 5:00 pm
IEEE Conference on Artificial Intelligence 2023June 5 @ 8:00 am - June 6 @ 5:00 pmSanta Clara CA United States
CDAO Insurance 2023June 13 - June 14