Presto the Future of Open Data Analytics, Foundation Says
The openness of Presto, its adherence to standard SQL, and the ubiquity and performance of modern cloud storage have combined to put Presto in the driver’s seat of the big data analytics stack for the foreseeable future, leaders of the Presto Foundation say.
In the battle for advanced analytics workloads, Presto is siding with data lakes, those expansive storage repositories that are built atop distributed file system and object stores. Since Presto is just a distributed query engine, it must be paired with a third-party storage platform. Many Presto users run it atop cloud object stores like S3, while others, including Facebook (which developed Presto as the follow-on to Apache Hive), run Presto atop modified Hadoop clusters.
On the other side of the advanced analytics battleline are the dedicated data warehouses that join compute and storage together. Vendors like Snowflake and Teradata, and purveyors of cloud data warehouse like Google Cloud and Amazon Web Services, say maximum analytics performance can be had by utilizing proprietary storage formats (although many of them, especially cloud platforms, are promoting a mix of storage types and analytic workload engines).
In a recent interview with Datanami, Presto Foundation co-chairs Dipti Borkar and Girish Baliga shared their thoughts on what’s driving the growth of the Presto engine and the broader Presto community.
“Presto’s philosophy has been to be the best engine for the data lake,” says Borkar, who is also the co-founder of Ahana, which hosts Presto on the cloud for customers. “We are the heart of the open data lake stack. You can consider it an open source data warehouse. Internally at Uber and Facebook, they actually call it the open source data warehouse. That’s Presto.”
Baliga, who’s day job is leading the Presto team at Uber, says that while dedicated data warehouses that pair compute and storage together will always hold a performance advantage over analytics systems that separate compute and storage, the cost of that approach are becoming untenable with today’s massive data volumes.
“There is a tradeoff both ways,” Baliga says. “Yes, they can make some things faster. But you ask yourself: Is that the thing you need? Are you willing to pay the price for that tradeoff? Everything in systems is a tradeoff. So that’s the tradeoff, in my mind.”
There are data management costs involved with using a dedicated data warehouse, according to Baliga. Unless an organization is storing all of its data in a warehouse (which is highly unlikely), then they will utilize another storage medium (probably a data lake) to house the bulk of data, and then use an ETL process to move the data into the warehouse for analysis.
“Every time you have an extra copy of data, you are paying the price for it,” Baliga says. “You have to maintain consistency. There is a delay, a lateness factor in the data–all of these things, which do not exist if you have a single copy of data in your data lake.”
The I/O characteristics of modern data lakes have improved in recent years to the point where they are on par with dedicated analytics warehouses from a few years ago, Baliga says.
“Storage technologies having evolved, so today if you put data on S3 or Google Cloud Storage, you will get I/O performance comparable to dedicated stores from a few years back,” says Baliga, who previously worked on in-house analytic engines with Google. “Yes, it is slower [than dedicated stores], but it is less expensive, and things will always get better. So there will be a point at which the performance is probably not worth the cost.”
As object stores improve their efficiency and capability, so will the dedicated data warehouses, which means the performance advantage of dedicated warehouses over disaggregated stacks will likely persist into the future. We saw this play out with Snowflake, which recently rolled out a data compression upgrade that resulted in a 30% increase in storage efficiency for all customers, across the board. That improvement will cause Snowflake customers to save $14 million over the course of the year, the company recently told Datanami.
At the end of the day, customers are looking for the right price-performance, Borkar says. “In a disaggregated stack, you will never be as close to performance as the tightly coupled stack,” she says. “But that’s good enough at one-third the price. [That] is why Presto has become so popular.”
The Presto Foundation was founded in 2019 with four members, including Facebook, Uber, Twitter, and Alibaba. Over time, the group has added additional members, including Ahana, Alluxio, Upsolver, Intel, and Starburst.
Intel is interested in participating with the Presto Foundation to help bolster the performance of the SQL engine on industry-standard X64 processors. Starburst, meanwhile, is participating with The Linux Foundation-backed group even as it concentrates on developing the fork of Presto called Trino (formerly PrestoSQL). The name-change has helped reduce confusion in the market, Borkar and Baliga say.
Another big advantage of Presto is its openness, although it’s harder to put a price tag on its value. Instead of storing data in a proprietary format, as all column-oriented analytics databases do, Presto users can leave their data on the data lake in an open format, such as Parquet or ORC, two of the most popular open source, column-oriented data formats to come out of the Hadoop era.
Presto users can run multiple different data processing engines on top of their ORC and Parquet data sets, which means organizations can use frameworks like TensorFlow to build machine learning models atop the same data sets that Presto is accessing, Borkar says.
“This flexibility of not being locked in, of an open format, of having the flexibility of different type of processing on the same data, without the need to transform it, is why we believe this is the next 10 to 20 year of analytics,” Borkar says. “Presto will be the heart of this stack from a SQL perspective, and then there will be machine learning workloads. There will be virtualization workloads. There will be other workloads that run on top.”
In a way, Presto is carrying forward the open source torch that Hadoop once carried. Many organizations bought into Hadoop’s promise of having a open data lake where many different computing engines could work upon the same data sets. However, the reality of running a Hadoop cluster, with the high level of technical complexity and software compatibility issues, ultimately stymied that vision.
What has emerged since the Hadoop era ended – disaggregated compute and storage stacks running on public clouds, often using Kubernetes as the workload manager – carries many of the same benefits that were initially harnessed to Hadoop, but without exposing as much technical complexity to the end users.
“The gamechanger here was the transition to cloud,” Baliga says. “Cloudera, Horton, all those guys–they were primarily focused on on-prem deployments. That became very complex when you have a web of technologies and don’t have control over how your system is used and who’s using it. And the customer also had to have dedicated teams to set up and manage all these deployments.”
Now that customers have transitioned to the cloud the technical complexity has reduced, and it has also reduced the cost of deployment and operations, he says.
Editor’s note: This story has been corrected. Girish Baliga was not previously employed at Facebook. Datanami regrets the error. The story was also updated to relfect the fact that, while Trino is a fork of Presto, it is not Starburst’s fork.