Starburst Backs Data Mesh Architecture
The emerging data mesh architecture has the potential to keep AI and analytics projects moving forward even as data storage and processing continues to disperse far and wide. One independent backer of the data mesh concept is Starburst, the company behind the version of the distributed SQL query engine Presto known as Trino.
Starburst CEO Justin Borgman says there is considerable momentum behind the data mesh concept.
“It seems to be gaining some momentum,” Borgman tells Datanami. “Basically, it’s an acknowledgement that data will be decentralized and that there are advantages to being decentralized, and that really what we’re trying to produce is a single point of access or single point of analytics across all that data regardless of where it lives.”
On its website, Starburst is positioning Trino (formerly PrestoSQL) as “the analytics engine for data mesh.” Trino, like Presto, is a distributed engine that can execute SQL queries against data stored in a range of databases and file systems. It was originally designed to work in Facebook’s modified Hadoop cluster, but today the biggest use cases arguably are querying data stored in S3 or S3-compatible object storage systems, as well as lakehouses such as Databricks’ Delta Lake.
“[Michael] Stonebraker famously said there’s no one-size-fits-all database, and that logically means you’re going to have a lot of different databases within your organization, and that those teams probably know that data the best,” Borgman says. “Each one is sort of domain specific in that regard. And so those teams would have their own data engineers that manage that data, but are kind of stitched together by this fabric, or this data mesh, and that’s where we come into play, by allowing you to see across all of those data sources.”
Cloud-based data lakes are the biggest repositories of data today, but they’re not the only place where data lives. By following the precepts of the data mesh architecture, Starburst aims to unify data analytics across distributed domains on behalf of its customers, including Comcast.
“Comcast is a great example,” Borgman said. “Back three years ago, their initial use case was basically Teradata and Hadoop. [Comcast said] we need to get access to both. We have viewing behavior in the data lake, like what shows people watch, and we’ve got billing data in Teradata. We want to be able to understand how the shows that people watch impact how much they spend with us, and do cross sell and upsell campaigns off those two datasets.”
Comcast uses Trino to as part of its “query fabric” that unifies data analytics activity across different domains, query engines, and storage repositories. The phrase query fabric basically means the same thing as data fabric, Borgman says. “Everybody has a different term,” he says.
Big multi-national companies may be on the leading edge of the data mesh movement for one simple reason: GDPR forces companies to maintain data about European residents within the physical boundaries of the country they reside in.
“If you’re a multinational organization, you increasingly have to contend with data privacy and data sovereignty regulations,” Borgman says. “Data in Switzerland that’s created in Switzerland by Swiss people has to be kept in Switzerland. The data that’s created in Germany has to stay in Germany. Data created in France has to stay in France.”
Emerging data and privacy laws and regulations “are essentially forcing a data mesh strategy,” Borgman continues. “It’s no longer possible to take all my data from Germany and France and all these different countries and pull them all together, because that would violate the rules, the laws.”
Starburst has created a product called Stargate that seeks to help companies get value out of their data while abiding by these new regulations. Stargate basically allows users to connect multiple Starburst clusters together, while ensuring that the data about customers never crosses the border.
“Stargate is essentially a Starburst to Starburst connector,” Borgman says. “It could be you’ve got one cluster in AWS East and one cluster in AWS Frankfurt. Or it could be multi-cloud. It could be one cluster in AWS and one cluster in Azure. But regardless of where that data actually lives, that local Starburst cluster is doing the processing and only returning the results that are effectively compliant.”
Starburst wants its Trino-based software to be the query engine for the emerging data mesh. But Borgman says it’s important to recognize that Starburst isn’t a data mesh, in and of itself. “The other components for this kind of model or this kind of design would be governance and access control, so for example companies like Immuta or Privacera,” he says.
Tracking data in the data mesh is important, and that’s where data catalogs from vendors like Collibra and Alation come into play, Borgman says. “And BI tools which aimed to visualize the data across these things,” he adds. “We have particular partnerships with Tableau and ThoughtSpot and [Microsoft’s] PowerBI that we work very closely with to help visualize the data that we can connect to.”
At the end of the day, the data mesh strategy is a compromise, like anything else. But when you abandon the forced centralization of data, you don’t give up on data quality or data governance. Instead, those steps and those disciplines are now simply conducted in a distributed manner, mirroring the natural state of the data itself.
“The reason I find [data mesh] particularly attractive is simply that it reflects to me what reality naturally looks like,” Borgman says. “It is so rare for a customer to truly implement the enterprise data warehouse to its fullest extent. To be able to actually have everything that you need in one place is, practically speaking, very challenging to do . And I think that goes all the way back to the earliest days of Teradata trying to do [that]. And now today Snowflake trying to do that.”
Data today is spread across object stores in the cloud. It’s in S3, Azure Data Lake storage, and Google Cloud Storage. It’s in MongoDB, Cassandra, and Aerospike NoSQL databases. It’s sitting in Databricks lakehouses and Snowflake warehouses. It’s in Hadoop clusters; SingleStore, CockroachDB, and Yugabyte NewSQL databases; and graph stores from Neo4j, TigerGraph, and Franz. It’s in a myriad of cloud- and on-prem object stores and distributed file systems. It’s on prem in Oracle, Db2, and Postgres relational databases. It’s flowing in Kafka, Pulsar, and other pub-sub systems. It’s in Excel worksheets and Access databases. The data genie is out of the bottle, and it’s never going back in.
“Going back to the Stonebreaker quote, there is no one size fits all database system,” Borgman says. “I think that applies still in the cloud era, as well. It’s just a different set of databases.”