Follow Datanami:
April 22, 2024

EDB Puts Postgres in the Middle of Analytics Workflow with New Lakehouse Stack

(MZinchenko/Shutterstock)

EnterpriseDB next month is expected to formally launch a new lakehouse that puts Postgres at the center of analytics workflows, with an eye toward future AI workflows. Currently codenamed Project Beacon, EDB’s new data lakehouse stack will utilize object storage, an open table format, and query accelerators to enable customers to query data through their standard Postgres interface, but in a highly scalable and performant manner.

The popularity of Postgres has skyrocketed in recent years as organizations have widely adopted the open source database for new applications, especially those running in the cloud. The database’s proven scale-up performance, historical stability, and adherence to ANSI standards has allowed it to become, in effect, the default relational database option for running online transaction processing (OLTP) workloads.

While Postgres’ fortunes have soared on the transactional side of the ledger, it hasn’t found nearly as much success when it comes to online analytical processing (OLAP) workloads. Organizations will typically do one of two things when they want to run analytical queries against data they have stored in Postgres: just deal with the meager analytical capabilities of the relational row store, or ETL (extract, transform, and load) the data into a purpose-built relational database that scales out and features columnar storage, which better supports OLAP-style aggregations.

Developing ETL data pipelines is difficult and adds complexity to the technology stack, but there hasn’t been a better solution to the data problem for more than 40 years. The advent of specialty NoSQL data stores last decade, and the current craze around vector databases for generative AI use cases today, has only exacerbated the complexity of big data movement.

(cybrain/Shutterstock)

The folks at EDB are now taking a crack at the problem. About a year ago, the Postgres backer began an R&D effort to create a scale-out version of Postgres, which would put it into competition with Postgres-based databases from companies like Yugabyte, Cockroach Labs, and Citus Data, which was acquired by Microsoft in 2019.

The company was nine months into that effort before hitting the pause button, said EDB’s Chief Product Engineering Officer Jozef de Vries. While the company may restart that effort, it sees more promise in the current effort around Project Beacon, which is currently being tested by early adopters.

“We’re really trying to capitalize on the popularity and standardization of the Postgres interface and the experience that Postgres provides, but decoupling the performance and data-scale issues from the Postgres core architecture itself,” de Vries said.

As it currently stands, Project Beacon is currently composed of AWS’s Amazon S3, Databricks’ Delta Lake table format (with Apache Iceberg support coming in the near future), the Apache Arrow in-memory columnar format, and Apache DataFusion, a fast, Rust-based SQL query engine designed to work with data stored in Arrow.

De Vries explained how it will all work:

“Postgres is the query interface. So they’re not directly querying with DataFusion. They’re not directly querying against S3. They’re querying against their Postgres interface, and those queries are executed through those systems behind the scenes,” he said. “So the object storage allows for greater volumes of data and also enables that data to be stored in a columnar format through the Delta Lake or Iceberg, and DataFusion is what allows the execution of the SQL queries against that data stored in the object storage.”

Data is replicated automatically from a customer’s Postgres database into S3, eliminating the need to deal with ETL pipelines, de Vries said. Customers will get the capability to query very large amounts of their Postgres data in near real-time with performance that Postgres itself is incapable of delivering.

“We want to go after those users that need to get more insights into that transactional data or operational data itself…and bring those capabilities closer in hand as opposed to offloading it onto third-party systems,” he told Datanami. “We’re abstracting away those underlying technologies–object storage, the storage formatting, DataFusion, those sort of things–so that users really only have to continue to interact with Postgres.”

Simplifying the tech stack not only makes life easier for the application developer, who don’t have to maintain “slow-running, high overhead ETL systems and a separate data warehouse system,” de Vries said. But it also provides faster time-to-insight by eliminating the lag time of nightly batch ETL workloads into the warehouse.

The company rolled the product, which does not yet have a formal name but is referred to as Project Beacon, in the middle of March. It plans to announce the general availability of the new stack in late May.

There are additional development plans around Project Beacon. The company is also looking to provide a unified interface, or a “single pane of glass,” to monitor and manage all of a customer’s Postgres databases, including EDB’s managed cloud databases like BigAnimal, other cloud and on-prem Postgres interfaces, and even third-party managed Postgres offerings like AWS’s Amazon RDS and Microsoft’s Flex Server.

The widespread adoption of Postgres has become an issue for some customers, de Vries said. “They’ve got database systems running all over the place,” he said. “It’s really complicated the lives of the DBA and IT and InfoSec teams, since they can’t really account for these data systems that are getting spun up.”

(Blackboard/Shutterstock)

The company also plans to eventually merge the Project Beacon lakehouse with Postgres databases into a single cluster, a la the hybrid transactional-analytical processing (HTAP) convergence. “We want to work towards a more HTAP-type experience where you can run transactional and analytical processing through the same instance,” he said.

“We still have some design and solutioning to do here,” he continued, “but for this system, it would detect whether those are analytically shaped queries or transactional shaped queries, and when they’re analytically shaped queries, to offload it to this analytical accelerator system that we’re building out. It simplifies…and gets the user closer to that near real-time analytical capability and keep them truly in the same clustered environment.”

Eventually, the plan calls for bringing additional capabilities, such as vector embeddings, vector search, and retrieval-augmented generation (RAG) workflows, into the EDB realm to make it easier to build AI and generative AI applications.

At the end of the day it’s all about helping customers build analytics and AI solutions, while keeping more of that work within the Postgres ecosystem, de Vries said.

“Developers love Postgres. They’re investing more into it. Every company we go into is using Postgres somewhere,” he said. “And these companies, particularly in the case of AI, are now trying to find other solutions to enable that AI application development. So can we keep it in the Postgres ecosystem, and then build on that to enable that AI application development?”

Related Items:

EnterpriseDB Bullish on Postgres’ 2024 Potential

Postgres Rolls Into 2024 with Massive Momentum. Can It Keep It Up?

Does Big Data Still Need Stacks?

 

 

Datanami