Transforming PostgreSQL into a Distributed, Scale-Out Database
PostgreSQL users who were considering adopting a distributed NoSQL database like MongoDB or Cassandra to gain scalability benefits for big data may want to think twice about that approach following today’s launch of new software that allows PostgreSQL to scale out horizontally, just like the NoSQL databases do.
PostgreSQL is one of the most popular relational databases, with millions of implementations over its 28-year history. Backers of the open source software have kept it relevant in the big data era by adding support for new data types, such as JSON, as well as a key-value store extension for speeding quick data lookups. And the folks at EnterpriseDB, the commercial entity behind PostgreSQL, have not been shy about touting the database’s capability to handle schema-less data simultaneously with traditional relational data (and doing it better than NoSQL, they claim).
But one trick that PostgreSQL hasn’t learned from its NoSQL and NewSQL cousins has been the capability to scale out horizontally. PostgreSQL, like most traditional relational databases designed to power transactional workloads, has always scaled up in a vertical manner, rather than scaling out in a horizontal manner. Need more CPU and memory to power a big application? Then buy a bigger server.
That approach may be changing as a result of the new pg_shard add-on from Citus Data. The San Francisco, California company has been building solutions based on PostgreSQL for years, most notably its analytic-oriented CitusDB database, as well as the cryptically named extension “cstore_fdw” that adds columnar database structure onto PostgreSQL.
The new pg_shard offering that Citus unveiled today builds on that experience by effectively transforming a single PostgreSQL instance into a distributed parallel database that can run on any number of nodes for the purpose of driving higher scalability, I/O, and availability. It’s a similar approach that CodeFutures took earlier this year with its tool for sharding and breaking up MySQL into a distributed database.
Citus’ sharding add-on basically turns PostgreSQL into a Hadoop-like system, says Umur Cubukcu, CEO and founder of Citus Data. “We handle automatic replication of data and automatic handling of failures, and elastic scaling up and down,” he tells Datanami. “But instead of doing it at the file system, we’re starting at the database as our core.”
Not a Magic Box
There are certain tradeoffs to this approach. “We don’t do magic, of course. but we make reasonable tradeoffs,” Cubukcu says. “We don’t take on everything that a database does.”
At the top of that list is ACID compliance. If you’re running a banking application on PostgreSQL and rely on the database for ensuring two-phase commits, it would not be wise to transform that to a distributed architecture that cannot guarantee transactional consistency.
But for certain types of applications that customers have attempted to run on PostgreSQL, the distributed path may be a better choice. Cubukcu sees distributed PostgeSQL providing a good alternative when the scale-up PostgreSQL approach runs out of storage, I/O, or computational headroom. In these cases, sharding PostgreSQL data to additional nodes can boost the overall capacity of the database to store and process data, and possibly slow or entirely eliminate the need to migrate the data and apps to a NoSQL solution.
“If you’ve got Postgres or MySQL or Oracle and you need to scale out, one option is to migrate to a Couchbase to tackle scale,” Cubukcu says. “But that requires remodeling the data, a new technology stack, and maybe hiring people with that skill set, versus just scale out your existing database.”
The pg_shard add-on doesn’t require any modifications be made to the PostgreSQL datagbase. All existing apps and data will be available just as they were before. The software will work with PostgreSQL version 9.3 and the upcoming 9.4 release, the company says.
While enabling existing PostgreSQL applications to scale out in a distributed manner is one promising for pg_shard, Cubukcu sees more potential in using the software as the basis for running SQL-based analytic applications (hence the Hadoop comparisons). The analytic capabilities of most NoSQL database are “sub-optimal by any measure,” Cubukcu says. “The Achilles Heel of those solutions is the analytics,” he says.
“If you want to run these ad hoc analytics on your large distributed dataset, in the world of [NoSQL], you would basically be running on MapReduce or Spark,” he adds. “But that’s where the relational model works real well. If I’m asking questions across users, such as ‘What percentage has done this action over this timeframe?’ your traditional real-time analytic questions lend themselves really well to SQL and SQL-based analytics.”
The combination of the increased I/O that a distributed PostgreSQL database can bring and the continuation of the SQL semantics will provide a powerful one-two punch for analyzing machine-generated data, clickstream data, and log data, Cubukcu says. “We are making it easy and simple to bring big data together with fast data on PostgreSQL,” he says.
Citus is releasing the pg_shard add-on a free and open piece of software. The company’s main money-maker, CitusDB, already offers some of the horizontal capabilities, but now this functionality is available to all PostgreSQL customers free of charge.