Tristan Handy’s Audacious Vision of the Future of Data Engineering
Tristan Handy is a lot of things: co-creator of dbt, founder and CEO of dbt Labs, and self-described “startup person.” But besides leading dbt Labs to a $4 billion valuation, he is one more thing: An audacious dreamer of a better data future. But will his vision become reality?
The story of dbt’s rise is fascinating in several respects. For instance, dbt, “data build tool” wasn’t originally intended to be used outside of Fishtown Analytics, the company Handy and his co-founders, Connor McArthur and Drew Banin, founded in 2016 before changing the name to dbt Labs in 2021. Handy and his co-founders developed an early version of dbt at RJMetrics before leaving and founding Fishtown Analytics to help early stage tech companies prep their data in Amazon Redshift.
“We set out to build a consulting business and do fun work,” Handy tells Datanami in an interview this week at Coalesce 2023, dbt Labs’ user conference in San Diego. “It’s been a lot of learning at many different parts of the journey for me, because this is not what I thought that I was getting into.”
Handy had no idea how popular dbt would become, or that it would eventually open the doors to tackling some of the gnarliest problems in enterprise data engineering that have stymied some of the world’s biggest corporations for decades. But with 30,000 companies now using the open source data transformation tool and steady growth in revenue from the company’s enterprise offering, dbt Cloud, it’s clear that dbt has touched off a new movement. The question is: Where will it go?
dbt’s Early Days
“The initial idea was Terraform for Redshift,” Handy says, referring to HashiCorp’s infrastructure-as-code tool that enable developers to safely and predictably provision and manage infrastructure in the cloud. Handy and his team wanted a reusable template that could sit atop SQL to automate the tedious, time-consuming, and potentially hazardous aspects of data transformation.
Handy is not shy about stealing ideas from software engineers. (Imitation is the sincerest form of flattery, after all.) The maturation of Web development tools and the whole DevOps movement proved fertile ground for Handy and his team to borrow ideas from, which have enhanced the field of data engineering.
“In data, we’re so scarred by having bad tooling for decades,” Handy says. “The way that this stuff plays out in software engineering is there’s this consistent layering of frameworks and programming languages on top of one another. When I started my career, if you wanted to build a Web application, you literally wrote raw HTML and CSS. There was nothing on top of it.
“But even as of 2010, you didn’t write raw HTML and CSS,” he continues. “You wrote Rails. Now you write React. You have these frameworks and the frameworks allow you to express higher-order concepts and not write as much boilerplate code. So the same thing that you would express in dbt, if you wrote the raw SQL for it, sometimes it’s double the length. Sometimes it’s 100 times the length. And the ability to be concise means there’s less code to maintain and you can move faster.”
A model is the core underlying asset that users create with dbt. Users write dbt code to describe the source or sources of data that will be the input, describe the transformation, and then output the data to a single table or view. Instead of deploying 100 data connectors to different endpoints in a data pipeline, as ETL tools will often do, a data transformation is defined once, and only once, in a dbt model. At runtime, a user can call a model or series of models to execute a transformation in a defined, declarative manner. This is a simpler approach that leaves less room for error.
“There’s these fundamental things in data engineering that everybody has to figure out how to do them, and the biggest thing is just things depend on other things,” Handy says. “SQL doesn’t have a concept of this thing depends on this thing, so run them in this order. From dbt’s very first version, it has the concept of these dependencies. That’s just one example, but there’s a million different examples of how that plays out.”
A Rising Star
Soon after founding Fishtown Analytics (it’s named after the community in Philadelphia, Pennsylvania where the company was based), Handy started getting an inkling that dbt might be more than just a tool for internal use.
“Our first ever non-consulting client who used dbt was Casper,” Handy says. “We worked with them for a week. Then they said, ‘This thing is cool. We’re going to move all of our code into it.’ We’re like, that’s not what we expected. Currently it’s only us that use it.”
So the company instrumented dbt to count the number of organizations using the software, which was available under an Apache 2.0 license. In the first year, 100 companies were using dbt on a regular basis. From there, dbt adoption steadily rose by about 10% per month.
“It turns out that 10% month-over-month growth, if you keep at it for two years, it’s 10x,” Handy says. “So it was really about three years in that we’re like, this line very soon is going to hit 1,000 companies using dbt. At that point in time, we were a consulting business with 15 employees. We had three or four software engineers.”
The business model had to change, so Handy started looking for investors. It raised a $12.9 million Series A round led by Andreessen Horowitz in early 2020, followed by a $29.5 Series B later that year. By that time, there were 3,000 dbt users globally and 490 customers paying for dbt Cloud, which it launched the previous year.
Another funny thing happened in 2020: The cloud exploded. Thanks in part to the COVID-19 pandemic and the overall maturation of technology, companies flocked to stuff all their data in cloud data platforms. That correlated with a huge uptick in dbt use and paying customers. To keep up with the growth, dbt Labs raised more venture funds: 150 million in a Series C round in June 2021, followed by a $222 million Series D in March 2022 that valued the company at $4.2 billion.
Suddenly, instead of enabling data analysts at smaller firms to “become heroes” by doing the work of overworked data engineers, dbt Labs had a new type of customer: the Fortune 100 enterprise. This turned out to be a whole new kettle of fish for the folks from Fishtown.
New Data Challenges…
“We onboarded our first Fortune 100 customer three or three-and-a-half years ago,” Handy says from a fourth-story boardroom in the San Diego Hilton Bayfront. “It turns out that problems with data in the enterprise are, like, really significantly more complicated than the early adopter community. It turns out that the dbt workflow is very suitable to solve these problems, as long as we can adapt it in some different ways.”
The prototypical Fortune 100 corporation is a mish-mash of various teams of people speaking different languages, working on different technology platforms, and having different data standards. Data integration has been a thorn in the large enterprises’ side for decades, owing to the natural diversity of massive organizations assembled through M&A, and the subsidiaries’ natural resistance to homogenization.
Zhamak Dehghani has done more to advance a solution to this problem with her concept of a data mesh. With the data mesh, Dehghani–who like Handy is a member of the Datanami People to Watch class of 2022–proposes that data teams can remain independent as long as they follow some principles of federated data governance.
dbt Mesh, which dbt Labs launched earlier this week at Coalesce, takes Dehghani’s ideas and implements them in the data transformation layer.
“We were very careful not to say ‘this is our data mesh solution,’ because Zhamak has very clear ideas of what data mesh is and what it isn’t,” Handy says. “I like Zhamak. She and I have gotten to know each other over the years. What I find in practice is that when I talk to data leaders, they love the description of the problem in data mesh. ‘Yes we absolutely have the problem that you’re describing.’ But they haven’t latched on to how do we solve this problem. And so what we’re trying to do is propose a very pragmatic solution to the problem that I think Zhamak pointed out very clearly.”
…And New Data Solutions
dbt Mesh enables teams of independent data analysts to do engineering work in a common project. If a team member tries to implement a data transformation that breaks one of the rules defined in dbt or breaks a dependency, then it will do something in the screen that is sure to get the users’ attention: it will not compile. This gets right to the heart of the problem in enterprise data engineering, Handy says.
“The problem in data engineering today is that something breaks, and because data pipelines are not constructed in a way that they’re modular, it means that this one thing actually breaks eight different connected pipelines, and it shows up in 18 different downstream dashboards. And you’re like, okay, then you have to figure out what actually broke,” Handy says.
“You spend four hours a day, whatever, trying to figure out what the root cause was. And then when you figure out what the root cause was, then you have to actually make that change in many different places and then verify. So the big point of dbt Mesh is that all of this stuff is connected, and …if a data set didn’t adhere to its contract, you didn’t wait to find out about it in production. You got it when you were writing that code. You didn’t get an alert in a dashboard. It’s like, no, you wrote code that doesn’t compile.”
Thet point is not to build software or dbt models that are so pristine that nothing ever breaks. Everything will eventually have bugs in it, Handy says. But by borrowing concepts from the world DevOps–where developers and administrators have closed the loop to accelerate problem detection and resolution–and merging them with Dehghani’s ideas of data mesh, Handy believes the field of data engineering can similarly be improved.
The end result is that Handy is genuinely optimistic about the future of data engineering. After years of suffering from substandard data engineering tools, there is a light at the end of the tunnel.
“You have people like you and me who have seen this story play out before,” he says. “And you talk to us and say, OK well, this is just the current wave of technology. What’s the next wave going to be? This is the modern data stack. What’s the post-modern data stack?”
The big breakthrough in 2020 was the rise of the cloud as the single repository for data. “The cloud means you can stop doing ETL. You can stop moving data around to transform it in some unscalable environment that’s hard to manage it well. You just write some SQL,” Handy says.
“Previously you had these technology waves that crested and then fell and then everybody had to rebuild everything from scratch,” he continues. “But I think that we are actually just going to consistently make progress….Now it’s kind of moved through that period of hype. Now we’re just doing the thing, trying to get the work done. Folks are building more integrations. We’re solving enterprise problems that maybe are not as visible as stuff that’s going on in AI communities. But this is the work. This is the thing that people have tried to solve for three decades, and have not done it. And I think we’re actually going to do it this time.”