Follow Datanami:
May 10, 2023

IBM Embraces Iceberg, Presto in New Watsonx Data Lakehouse

(Francesco Scatena/Shutterstock)

IBM yesterday unveiled watsonx.data, a new data lakehouse offering for cloud and on-prem that will use object storage and Apache Iceberg, an open data format. Big Blue launched two other offerings in the new watsonx family yesterday at its annual THINK conference, including watsonx.AI and watsonx.governance. Together, the three watsonx components represents IBM’s latest push into the enterprise AI market.

Lakehouses have proliferated in recent years as companies look to combine the massive scalability of cloud-based object storage while borrowing the proven data management and governance capabilities of traditional data warehouses running on analytics databases. Instead of ungovernable data swamps, the lakehouse is designed to bring order to data, but without the storage limitations posed by data warehouses.

When it becomes generally available in July, IBM’s new Watsonx.data lakehouse will run on-prem and in the IBM Cloud and AWS. While IBM didn’t specify in its announcement, the offering is assumed to utilize IBM’s own flavor of object storage, which it obtained with its 2015 acquisition of Cleversafe for $1.5 billion.

Watsonx.data will also incorporate Apache Iceberg, the increasingly popular open table format that emerged from Netflix and Apple to address data consistency and correctness issues that arose with the reliance on Apache Hive in the early days of Hadoop-based data lakes. By bringing support for ACID transactions to data, Iceberg enables customers to bring multiple compute engines to bear on data residing in a lake or lakehouse.

To that end, IBM foresees Presto and Apache Spark being two of the first data engines to run in its watsonx.data lakehouse. IBM has been a big supporter of Spark for years, both in terms of running it on behalf of customers and making upstream code changes to the project.

But IBM also has a sizable investment in Presto, the distributed query engine from that came out of Facebook last decade as the replacement for Apache Hive (which it also created). With its capability to read data from multiple data stores and serve up fast ad-hoc queries, Presto has emerged as one of the leading processing engines for the modern data stack.

IBM moved into the Presto business last month with its acquisition of Ahana, a Silicon Valley startup that’s building a Presto-based business in the cloud. Ahana had raised $32 million and was building its cloud-based Presto business, which competes with Trino-backer Starburst (Trino is a fork of Presto) and Amazon Athena, the serverless AWS analytics service that uses Presto and Trino).

IBM says that, in the future, watsonx.data will incorporate its Storage Fusion technology “to enhance data caching across remote sources as well as semantic automation capabilities built on IBM Research’s foundation models to automate data discovery, exploration, and enrichment through conversational user experiences.”

Watsonx.data will feature built-in governance capabilities for data house in the lake. The company also launched watsonx.governance to help provide guardrails and transparency for AI and machine learning models developed in watsonx.ai, which is another new offering unveiled by IBM. Specifically, IBM says watsonx.governance will “provide the mechanisms to protect customer privacy, proactively detect model bias and drift, and help organizations meet their ethics standards.”

Watsonx.ai, meanwhile, will function as a new development studio for building AI applications. The offering will include a library of “foundation models” upon which customers can build AI applications. In addition to language models, IBM will include models designed to work with code, time-series data, tabular data, geospatial data, and IT events data, IBM says.

Among the models that will be included in watsonx.ai are: fm.code, which automatically generate code for developers through a natural-language interface; fm.NLP, a collection of large language models (LLMs) for specific and industry-specific domains; and fm.geospatial, a model built on climate and remote sensing data to help organizations understand and plan for changes in natural disaster patterns, biodiversity, land use, and other geophysical processes, IBM says. IBM will also incorporate into watsonx.ai thousands of natural language processing (NLP) models developed by Hugging Face.

The new watsonx line of offerings will give customers the tools they need for building next-gen AI models while retaining governance and control, says Arvind Krishna, IBM chairman and CEO.

“With the development of foundation models, AI for business is more powerful than ever,” Krishna said in a press release. “Foundation models make deploying AI significantly more scalable, affordable, and efficient. We built IBM watsonx for the needs of enterprises, so that clients can be more than just users, they can become AI advantaged. With IBM watsonx, clients can quickly train and deploy custom AI capabilities across their entire business, all while retaining full control of their data.”

Related Items:

IBM Joins the Presto Foundation with Acquisition of Ahana

Open Table Formats Square Off in Lakehouse Data Smackdown

Snowflake, AWS Warm Up to Apache Iceberg

Editor’s note: This article has been corrected. The headline was changed to reflect IBM’s focus on Presto, not Trino. Datanami regrets the error.

Datanami