December 4, 2014

Transforming PostgreSQL into a Distributed, Scale-Out Database

Alex Woodie

PostgreSQL users who were considering adopting a distributed NoSQL database like MongoDB or Cassandra to gain scalability benefits for big data may want to think twice about that approach following today’s launch of new software that allows PostgreSQL to scale out horizontally, just like the NoSQL databases do.

PostgreSQL is one of the most popular relational databases, with millions of implementations over its 28-year history. Backers of the open source software have kept it relevant in the big data era by adding support for new data types, such as JSON, as well as a key-value store extension for speeding quick data lookups. And the folks at EnterpriseDB, the commercial entity behind PostgreSQL, have not been shy about touting the database’s capability to handle schema-less data simultaneously with traditional relational data (and doing it better than NoSQL, they claim).

But one trick that PostgreSQL hasn’t learned from its NoSQL and NewSQL cousins has been the capability to scale out horizontally. PostgreSQL, like most traditional relational databases designed to power transactional workloads, has always scaled up in a vertical manner, rather than scaling out in a horizontal manner. Need more CPU and memory to power a big application? Then buy a bigger server.

That approach may be changing as a result of the new pg_shard add-on from Citus Data. The San Francisco, California company has been building solutions based on PostgreSQL for years, most notably its analytic-oriented CitusDB database, as well as the cryptically named extension “cstore_fdw” that adds columnar database structure onto PostgreSQL.

The new pg_shard offering that Citus unveiled today builds on that experience by effectively transforming a single PostgreSQL instance into a distributed parallel database that can run on any number of nodes for the purpose of driving higher scalability, I/O, and availability. It’s a similar approach that CodeFutures took earlier this year with its tool for sharding and breaking up MySQL into a distributed database.

Citus’ sharding add-on basically turns PostgreSQL into a Hadoop-like system, says Umur Cubukcu, CEO and founder of Citus Data. “We handle automatic replication of data and automatic handling of failures, and elastic scaling up and down,” he tells Datanami. “But instead of doing it at the file system, we’re starting at the database as our core.”

Not a Magic Box

There are certain tradeoffs to this approach. “We don’t do magic, of course. but we make reasonable tradeoffs,” Cubukcu says. “We don’t take on everything that a database does.”

At the top of that list is ACID compliance. If you’re running a banking application on PostgreSQL and rely on the database for ensuring two-phase commits, it would not be wise to transform that to a distributed architecture that cannot guarantee transactional consistency.

But for certain types of applications that customers have attempted to run on PostgreSQL, the distributed path may be a better choice. Cubukcu sees distributed PostgeSQL providing a good alternative when the scale-up PostgreSQL approach runs out of storage, I/O, or computational headroom. In these cases, sharding PostgreSQL data to additional nodes can boost the overall capacity of the database to store and process data, and possibly slow or entirely eliminate the need to migrate the data and apps to a NoSQL solution.

“If you’ve got Postgres or MySQL or Oracle and you need to scale out, one option is to migrate to a Couchbase to tackle scale,” Cubukcu says. “But that requires remodeling the data, a new technology stack, and maybe hiring people with that skill set, versus just scale out your existing database.”

Citus Data CEO and cofounder Umur Cubukcu

The pg_shard add-on doesn’t require any modifications be made to the PostgreSQL datagbase. All existing apps and data will be available just as they were before. The software will work with PostgreSQL version 9.3 and the upcoming 9.4 release, the company says.

While enabling existing PostgreSQL applications to scale out in a distributed manner is one promising for pg_shard, Cubukcu sees more potential in using the software as the basis for running SQL-based analytic applications (hence the Hadoop comparisons). The analytic capabilities of most NoSQL database are “sub-optimal by any measure,” Cubukcu says. “The Achilles Heel of those solutions is the analytics,” he says.

“If you want to run these ad hoc analytics on your large distributed dataset, in the world of [NoSQL], you would basically be running on MapReduce or Spark,” he adds. “But that’s where the relational model works real well. If I’m asking questions across users, such as ‘What percentage has done this action over this timeframe?’ your traditional real-time analytic questions lend themselves really well to SQL and SQL-based analytics.”

The combination of the increased I/O that a distributed PostgreSQL database can bring and the continuation of the SQL semantics will provide a powerful one-two punch for analyzing machine-generated data, clickstream data, and log data, Cubukcu says. “We are making it easy and simple to bring big data together with fast data on PostgreSQL,” he says.

Citus is releasing the pg_shard add-on a free and open piece of software. The company’s main money-maker, CitusDB, already offers some of the horizontal capabilities, but now this functionality is available to all PostgreSQL customers free of charge.

Go Shard Thyself, CodeFutures Says

RDBMs: The Hot New Technology of 2014?

Applications: Complex Event Processing

Technologies: Middleware

Sectors: Financial Services, Retail

Vendors: Citus Data, Couchbase, MongoDB

Tags: Hadoop, postgresql, sharding

Only registered users may comment. Register using the form below.

Check off newsletters you would like to receive*
- HPCwire
- EnterpriseTech
- Datanami
- Technology Conferences & Events
- Advanced Computing Job Bank
- Technology Product Showcase
Email*
Name*
First Last
Organization*
Job Function*
Industry*
Country*
City*
State*
Province*
- Please check here to receive valuable email offers from Datanami on behalf of our select partners.

Transforming PostgreSQL into a Distributed, Scale-Out Database

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

April 26, 2024

April 25, 2024

April 24, 2024

April 23, 2024

Sponsored Partner Content

Get your Data AI Ready – Celebrate One Year of Deep Dish Data Virtual Series!

Supercharge Your Data Lake with Spark 3.3

Learn How to Build a Custom Chatbot Using a RAG Workflow in Minutes [Hands-on Demo]

Overcome ETL Bottlenecks with Metadata-driven Integration for the AI Era [Free Guide]

Gartner® Hype Cycle™ for Analytics and Business Intelligence 2023

The Art of Mastering Data Quality for AI and Analytics

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Top 6 Strategies for Reducing Data Warehouse Costs

Building an Operational Data Warehouse for Real-time Analytics

Sponsored Multimedia

The Power of DataOps: Bring Automation to Life
No Comments

Tactical Steps for Cloud Migration
No Comments

Immuta Data Access Platform
No Comments

Data Mesh: Fact or Fiction?
No Comments

Contributors

Featured Events

AI & Big Data Expo North America 2024

CDAO Canada Public Sector 2024

AI Hardware & Edge AI Summit Europe

AI Hardware & Edge AI Summit 2024

CDAO Government 2024

Transforming PostgreSQL into a Distributed, Scale-Out Database

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

April 26, 2024

April 25, 2024

April 24, 2024

April 23, 2024

Most Read Features

Most Read News In Brief

Most Read This Just In

Sponsored Partner Content

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Sponsored Multimedia

Contributors

Featured Events

Share

Copy short link