IBM to Showcase Open Analytics Push at PrestoCon Day
The PrestoDB community will come together this Wednesday for PrestoCon Day, the third annual virtual event showcasing the popular open source SQL engine. Representatives from Uber, Adobe, Alibaba, and TikTok will share stories about how they use PrestoDB and open analytics in general. One vendor looking to make a splash is IBM, which is the new owner of an enterprise PrestoDB offering and the latest adherent to open lakehouse architectures.
Thanks to its consistently high performance on batch and interactive workloads, ability to scale linearly, and adherence to ANSI SQL standards, PrestoDB has become one of the most popular open query engines available today. The software was originally developed at Facebook as the successor to Apache Hive during the Hadoop heyday, but today PrestoDB is readily used on a wide range of big data repositories, including relational databases, object storage stores, and distributed file systems.
PrestoDB naturally will be the star attraction at PrestoCon Day, a one-day virtual event that is free to attend. The fun starts at 8:30 a.m. PT, when Presto Foundation members Ali LeClerc and Girish Baliga welcome the community together and deliver their opening remarks. More than 20 sessions follow, ranging from case studies on PrestoDB usage at Bytedance and Alibaba Cloud to discussions on the latest Presto features, such as Intel’s contribution to Project Velox to how PrestoDB fits into HPE’s Ezmeral lineup.
IBM will also attend the virtual event. While Big Blue is a longtime purveyor of proprietary software and systems, it is now in the midst of a full-scale embrace of open analytics and open platforms, such as PrestoDB and data lakehouse architectures.
IBM is the newest member of the Presto Foundation, the governing body behind PrestoDB. Thanks to its acquisition of PrestoDB vendor Ahana in April, the company joined a seat on the board of the Presto Foundation, which is a part of the Linux Foundation.
Presto fits neatly into IBM’s new lakehouse offering, called Watsonx.data, which it unveiled in May. Lakehouses have grown in popularity as a happy medium between data lakes such as Hadoop, which had a habit of turning into ungoverned but super-scalable data free-for-alls, and data warehouses, which delivered good data reliability and governance but carried extra cost and had limited scalability.
Compared to IBM’s previous forays into big data and Hadoop, the new generation of open analytics technologies, as personified by Presto, is much more ready for prime-time, says Vikram Murali, who is the vice president of software development for data and AI at IBM.
“I truly believe we are at a point where, when we GA this thing, customers will see that we have solved a lot of these issues,” Murali says. “And that is one of the reasons, by the way, that we chose Presto. We could have chosen to go down the route of creating another proprietary engine. But instead, we wanted to go with something that was available in open source, something that was mature, where companies like Uber and Meta have been using it for years, and they have already solved the scalability [and] the elasticity [issues]. So all of those have become table stakes now, and that’s what we gain by going with Presto.”
IBM plans to make its Watsonx.data lakehouse general availability next month. The plan calls for launching two fully managed Watsonx.data lakehouse offerings, including one on AWS that uses S3 storage (basically the pre-existing managed offering from Ahana) and another on IBM Cloud that uses S3-compatable storage from IBM’s Cloud Object Store (COS).
Users can also deploy Watsonx.data in a hybrid manner mixing cloud and on-prem storage, and they can bring multiple query engines to bear on the data stored there, Murali says. “The way we differentiate our lakehouse offering is that we are truly hybrid,” he tells Datanami. “You can deploy it anywhere–on-prem, cloud–but it’s also multi-engine.”
Specifically, Watsonx.data users running in the IBM Cloud will use OpenShift Data Foundation (ODF) as the core object storage systems in COS, Murali says. However, users also have the option of running Watsonx.data on-prem if they want, in which case any S3-compatiable object store will work, including Minio or even the old Cleversafe object storage offering, which today is sold as part of COS. The underlying technology for managing these hybrid cloud storage setups is based on NooBaa, a data gateway acquired by Red Hat a few years ago, Murali says.
IBM is supporting PrestoDB as the core analytics engine for the Watsonx.data lakehouse. But it’s not the only engine that IBM will be pushing. When Watsonx.data goes GA next month, users will also see Apache Spark, which will enable users to bring more data engineering and data science-focused workloads into the lakehouse. IBM, of course, has a long history supporting Spark, so this is not a surprise.
But in addition to PrestoDB and Spark, IBM will bring Db2 and Netezza engines into the Watsonx.data lakehouse, Murali says. The plan is for those engines to be ready next month when the cloud lakehouse services become available, he says. Eventually, users will also be able to bring other open analytics engines, such as Dremio, to bear on the data, he says. (IBM did not give a clear answer when asked whether it would also support Trino, the fork of Presto backed by Starburst.)
One of the key pieces of technologies that allows so many open source engines to be used on the same Parquet, Avro, or ORC dataset without turning it into an ungoverned digital cesspool is Apache Iceberg. The open table format will help to keep all the data straight in Watsonx.data as multiple customers use multiple query engines to process it, Murali says.
“If they have Dremio or any other engine, they can choose to bring that,” he says. “We hope customers will come through Presto. But any engine they choose, we want them to come through that [Iceberg] metadata layer. That way we know what’s going on and we can maintain consistency across multiple engines.”
Much of the Presto ecosystem has rallied behind Iceberg, which came out of Netflix and Apple. But more is merrier in this new open world, and so there’s always room for another approach, which also applies to table formats. To that end, the company is actively working to ensure compatibility in Watsonx.data with Apache Hudi, which came out of Uber.
“I think this is why the Presto community shines,” says Girish Baliga, who is director of engineering at Uber and also the governing board chair of the Presto Foundation. “We have people who use it with different formats, and the engine allows us to do that pretty easily.”
There is a lot of momentum behind Iceberg in the Presto community, Baliga says, even though Hudi was already in development at Uber. “But from Uber’s perspective, more is better,” he tells Datanami. “I think putting up common layers that address all formats into the engine itself leads to a better, more open architecture.”
Embracing openness is certainly a strategy for IBM, which still has a large installed base of enterprise customers running Db2 and Netezza data warehouses, not to mention millions of tables of data (and plenty of old flat files) stored on proprietary Power and System Z mainframe systems. While there are no easy buttons when it comes to integrating this long tail of legacy IT systems with modern data stacks, IBM is clearly intent on doing all it can to lower the barrier of entry to get its customers to adopt newer tech, if not as a replacement mechanism then for new data projects.
“One of the main value-adds is how we package all of this together, where it’s easy and up and running probably in a few minutes, instead of the customer becoming the integrator,” Murali says. “Presto by itself is free. You can download it. You can install it. But what we want to help customers with is how easy it is to deploy, make administration of it easy, and fix vulnerabilities. That is something which is very, very key for our enterprise customers making sure that critical Sev One security vulnerabilities, all of those things are fixed and how we package the entire solution.”
You can register for PrestoCon Day here.