Follow Datanami:
November 16, 2017

Distributed PostgreSQL Settling Into Cloud

Organizations that want the scalability of a distributed PostgreSQL database but don’t want the hassle of managing it themselves may be interested in the latest news from Citus Data, which today unveiled new options for its hosted, scale-out relational database.

Citus Data broke onto the scene a couple of years ago with an extension to PostgreSQL that transforms the popular relational database into a distributed database. Called Citus (it was originally called CitusDB), the open source software enables PostgreSQL to run on multiple physical nodes, complete with multiple executors, parallel execution, fault tolerance, and automatic sharding.

In short, Citus gives you all the horizontal scalability you’d expect to find in a modern NoSQL database, but without giving up the transactional consistency and SQL familiarity you’d find in a relational database. “So you can kind of have your cake and eat it too,” says Citus Data’s Head of Cloud, Craig Kerstiens.

The product has been well-received and is being downloaded thousands of times per month and used across industries, from finance and ad tech to Web analytics and cybersecurity, according to Citus Data. The largest Citus cluster is now approaching 1PB of data across a 2,000 core cluster with about 10TB of memory. That’s a huge amount of data to house in a single PostgreSQL database. Those are Hadoop-like numbers, really, but the difference is that all the data in the Citus cluster is online and ready to be transacted upon by an application that thinks it’s talking to a plain vanilla PostgreSQL database running on a single machine.

About a year ago, Citus Data unveiled a hosted cloud option for Citus running on Amazon Web Services. This service has been very well-received by customers because it’s eliminated much of the day-to-day management associated with running a parallelized PostgreSQL cluster, says Citus Data CEO and founder Umur Cubukcu.

Managing a sharded database isn’t easy, and requires a certain degree of skill and experience. “Databases are hard. Distributed databases are harder,” Cubukcu tells Datanami. “What happens when I lose a machine? How do roll back, take backups, add new nodes, or reshard my data?”

If an organization has to assign their best engineer to managing the distributed cluster, they can plant a firm “win” in the distributed database column. But they also must recognize the opportunity costs associated with that move, Cubukcu says. With its hosted Citus offering, the company tells customers to leave the management of the distributed database to them, thereby freeing up their top engineers to work on developing cutting edge applications.

With today’s announcement, Citus Data is making it easier for customers to get going with its hosted distributed database. Among the new features is Citus Warp, which lets customers migrate from PostgreSQL to Citus with the push of a button, the company claims. “There’s no downtime,” Kerstiens says. “You replicate the data live, then schedule a cutover, which is a matter of milliseconds.”

Other new features the San Francisco-based company has added includes:

  • a zero-downtime shard rebalancer, which the company claims enables customers to elastically scale out memory, compute and storage as they add nodes;
  • a point-in-time recovery that allows users to roll back their database to any point in the last 10 days to recover lost data;
  • the capability to offload read-only queries to a replicated copy of the database through the new “followers” feature;
  • a new “fork” function that lets customers experiment or run complex queries with a full copy of their database;
  • support for distributed ACID transactions;
  • and multi-tenant support for Rails and Django frameworks.

The fully managed Citus Cloud offering is currently restricted to AWS, although there’s nothing stopping customers from running the database themselves on any cloud platform. According to Kerstiens, it can scale up to 100 nodes in the cloud with access to 240 GB of RAM per node.

“That’s a very very large cluster, a scale you can [typically] only get with on-premise hardware,” he says.

Related Items:

Rethinking Operational Analytics on Relational DBs

Cloud In, Hadoop Out as Hot Repository for Big Data