March 4, 2022

Meet 2022 Datanami Person to Watch Ryan Blue

Alex Woodie

As the co-creator of Apache Iceberg, Ryan Blue played a central role in establishing the table data format as a new standard in the open data ecosystem. As the CEO of Tabular, Blue is also buliding a commerical entity around Iceberg. We recently caught up with Blue, who is one of our Datanami People to Watch for 2022.

Datanami: Apache Iceberg has filled a need for an open table format for a variety of computational frameworks, including Hive, Spark, Flink, PrestoDB, and Trino. What spurred you to develop it?

Ryan Blue: Before joining Netflix, I had a lot of conversations about fixing tables—it was a well-known problem and it seemed like each company I talked with had different approaches to making pipelines reliable. At Netflix, the problems were more urgent because we were working with data in S3 rather than HDFS. Directory listing couldn’t be trusted, latency was higher, and Netflix scale meant hitting “rare” problems all the time.

We started keeping track of just how many problems were caused by the simplicity of the Hive format and found that we could solve many pressing issues: the need to scale the Hive metastore, S3 latency, number of S3 operations, and S3 eventual consistency.

In the end, I think what pushed us to actually build it, rather than maintaining work-arounds, was that it was so painful for our data engineering partners. They’d regularly use a type that worked in only one engine, or drop a column and corrupt a table, or not know that to guarantee correctness Spark would automatically overwrite rather than insert. It was so painful to work with our platform that we had to do something.

The key was recognizing that our infrastructure problems and our customers’ pain had the same cause: a table format that wasn’t up to the task for data warehouse workloads.

Datanami: What do you really like about the open source community? Why is this the right way to develop software for enterprises?

Blue: The Iceberg community is full of amazing engineers and it’s been great to see the project grow far beyond what we would have been able to accomplish at Netflix alone. The list of contributions is really amazing. Things like SQL extensions to make it easy to run maintenance tasks or to configure a table’s sort order would never have happened, not to mention the integrations with all of the processing engines.

Of course, this was the goal of donating the project to the ASF. But it’s one thing to put a project out there and another to see people actually adopt it, and then to invest so heavily in improving it.

I’m glad to see it because this is what the larger big data community needs: a standard for cloud-native analytic tables that works across all the engines we already use. The only way to do that is through a healthy community that wants to welcome new people and use cases, and is neutral so everyone can confidently invest in support for the standard.

Datanami: What do you hope to see from the big data community in the coming year?

Blue: I’m excited to have more people using Tabular’s data platform, of course. But that aside, there are some things I think are set to make significant progress this year. The first is making data engineering more declarative. Even though we use SQL-like systems, people spend too much time worrying about how something is done instead of telling their tools what to do. I think this is one of the design principles that makes dbt so successful. This has been improving as SQL-like engines mature and I hope to see more improvements over the next year.

We’ve been working toward declarative data engineering in the Iceberg community for a long time with things like table-level configuration and hidden partitioning, but some features we added to Spark 3.2 make it more possible, like clustering and sorting as table attributes. It will be good to see people picking up those features and no longer worrying about rebuilding and testing jobs just to tweak the output clustering.

Along the same lines, there are some exciting developments in the view space. I’m hearing a lot more about materialized views lately. And there are some promising projects to be able to share views across database engines, like Substrait, which is a shared representation aimed at making it possible to exchange logical SQL plans. Having one definition work across Spark and Trino, for example, is a big win.

And the last thing is that I’m hoping to see more companies adopt Iceberg as the standard for analytic tables. In the last few months, Starburst, Dremio, Athena, EMR, and Snowflake have all announced support and I’m excited to see that momentum continue!

Datanami: Outside of the professional sphere, what can you share about yourself that your colleagues might be surprised to learn – any unique hobbies or stories?

Blue: A few weeks into the Pandemic, I started running every day to make sure I got out of the house and it turned into something I’ve kept doing every day. I’m at 650 days now, and I’m going to try to make it until the “end”. That’s hopefully soon, since we’re close to vaccines for kids under 5.

You can read the interview with Blue and other 2022 Datanami People to Watch winners at this link.

Applications: Enterprise Analytics

Technologies: Frameworks

Vendors: Tabular

Tags: Apache Iceberg, big data, Datanami Person to Watch, Ryan Blue

Only registered users may comment. Register using the form below.

Check off newsletters you would like to receive*
- HPCwire
- EnterpriseTech
- Datanami
- Technology Conferences & Events
- Advanced Computing Job Bank
- Technology Product Showcase
Email*
Name*
First Last
Organization*
Job Function*
Industry*
Country*
City*
State*
Province*
- Please check here to receive valuable email offers from Datanami on behalf of our select partners.

Meet 2022 Datanami Person to Watch Ryan Blue

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

May 14, 2024

May 13, 2024

May 10, 2024

May 9, 2024

May 8, 2024

Sponsored Partner Content

Get your Data AI Ready – Celebrate One Year of Deep Dish Data Virtual Series!

Supercharge Your Data Lake with Spark 3.3

Learn How to Build a Custom Chatbot Using a RAG Workflow in Minutes [Hands-on Demo]

Overcome ETL Bottlenecks with Metadata-driven Integration for the AI Era [Free Guide]

Gartner® Hype Cycle™ for Analytics and Business Intelligence 2023

The Art of Mastering Data Quality for AI and Analytics

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Top 6 Strategies for Reducing Data Warehouse Costs

Building an Operational Data Warehouse for Real-time Analytics

Sponsored Multimedia

The Power of DataOps: Bring Automation to Life
No Comments

Tactical Steps for Cloud Migration
No Comments

Immuta Data Access Platform
No Comments

Data Mesh: Fact or Fiction?
No Comments

Contributors

Featured Events

AI & Big Data Expo North America 2024

CDAO Canada Public Sector 2024

AI Hardware & Edge AI Summit Europe

AI Hardware & Edge AI Summit 2024

CDAO Government 2024

Meet 2022 Datanami Person to Watch Ryan Blue

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

May 14, 2024

May 13, 2024

May 10, 2024

May 9, 2024

May 8, 2024

Most Read Features

Most Read News In Brief

Most Read This Just In

Sponsored Partner Content

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Sponsored Multimedia

Contributors

Featured Events

Share

Copy short link