Rethinking Operational Analytics on Relational DBs
When it comes to picking a platform for big data analytics, there are as many options as there are types of analytics. Hadoop scales to the moon, but it may be overkill for lighter weight analytics on production data. Distributed databases may be able to handle these so-called “operational analytic” workloads, but not without sacrificing familiarity and ease of use. In some cases, a parallelized version of a good old relational database, like PostgreSQL or MySQL, may be just the ticket.
Several new classes of technology are battling it out in the race for big data supremacy, which makes this publication’s mission such a daily adventure. Hadoop is clearly the top dog in the big data space, able to scale horizontally to thousands of nodes and handling in excess of 100 PB, providing a veritable wonderland of information for data science wizards to discover amazing insights. Some organizations clearly need this level of big data majesty, but the truth is, many don’t.
Organizations that can’t yet justify Hadoop (or other large-scale analytic platforms built on Spark or Mesos) may look to scale-out databases to handle their operational analytic needs, which is often referred to as analysis on live data. There are NoSQL databases like MongoDB, Cassandra, and Couchbase that have completely re-written the rules of data storage (adherence to ACID, reliance on SQL, etc.). And then there are distributed relational databases (often dubbed NewSQL) that retain those familiar elements from relational databases, but scale out horizontally like their NoSQL brethren. Clustrix, MemSQL, and VoltDB are good examples of NewSQL databases.
But there’s another class of technology that extends traditional relational databases out into the distributed realm. In December we wrote about how Deep Information Sciences enables a Croatian lab to continue its genetic research when its data set exceeded the capabilities that a single MySQL server. After it adopted the DeepSQL engine and its “hyper-indexing” capability, the lab was able to perform database joins that exceeded the 200GB of RAM.
Elsewhere, there are outfits enabling true parallelization of relational databases. For example, CitusData today unveiled a new release of its drop-in technology for PostgreSQL that enables a single PostgreSQL database to run across multiple nodes. The software, called CitusDB 5.0, incorporates the auto-sharding capability that it previously sold in a product called pg_shard, as well as new execution paths designed to simultaneously support both analytic and transaction workloads.
Citus Data CEO and founder Umur Cubukcu explains what customers get when they download CitusDB, which is free. “When you get CitusDB you get PostgreSQL 9.5, the latest release of the database,” he says. “But then you say ‘create extension Citus,’ and that transforms PostgreSQL into a distributed database with multiple executors for different workloads, parallel execution, fault tolerance, auto-sharding, replication, etc.”
Citus Data is targeting a class of user that needs to perform behavioral analytics on fast-moving real-time data, often time-based and event-based data, such as Web and mobile clickstreams, machine data, and sensor data from the Internet of Things. Oftentimes, this type of system powers a dashboard that’s continually updated to ease decision making in the business.
This type of fast-flow data can quickly overwhelm a single PostgreSQL instance, says Craig Kerstiens, the former engineer at PostgreSQL-as-a-service provider Heroko whom Cubukcu recently brought in to head up product.
“We ran into that at Heroku over and over again, where an application would start storing clickstream or event or time-series or IOT and sensor data in a table, and that one table would grow and grow,” Kerstiens says. “As soon as it got to 100GB or bigger, that was a problem for single node PostgreSQL. At that point, developers had to do something. It usually required a lot of engineers to figure out a solution–maybe another system or manual sharding. There’s a number of options, but really no perfect ones.”
By using a CitusDB-powered distributed PostgreSQL database, customers can continue to collect data in an operational system and perform analytics upon it at the same time. It’s not replacing a Hadoop data lake or an MPP data warehouse, which many CitusDB still use for big batch-oriented ETL, SQL, or machine learning workloads. Instead, it’s filling a gap, which is the need for real-time insight on fast moving data, Cubukcu says.
“If you have 1TB of data or hundreds of gigabytes that don’t fit into the memory of a single box you’re in difficult territory,” he says. “I look at CitusDB as more of a real time data warehouse, if you will. Some of our folks use it in front of [Amazon] RedShift or [Google] BigQuery. If you have 100GB of data in BigQuery, that’s still a lot of data to serve from a single machine.”
The software is rarely used on clusters with more than a few dozen nodes, according to Cubukcu. But that’s enough to handle data sets measuring into the hundreds of terabytes. Because the PostgreSQL innards are bullet-proof, Cubukcu doesn’t have to worry about concurrency. The real sweet spot, he says, is “real-time ingest with sub-second read queries and a lot of concurrency. Not just five or 50 products or analysts looking at the data, but thousands of requests hitting the database at the same time.”
Citus Data is giving away CitusDB as a free and open source product. There is an enterprise edition that includes enterprise support, administrative tools for balancing and scaling PostgreSQL instances, and access controls for improved security. It also includes the Citus Real-Time Behavior Analytics library, a set of C-based behavioral and funnel analytics.
At the end of the day, customers have a lot of options for big data analytics. But as Cubukcu sees it, customers don’t have to trade the comfort of traditional relational systems like PostgreSQL or MySQL for more exotic tech like NoSQL and Hadoop.
“In a way, distributed systems have regressed a lot in terms of how simply these things can be done,” he says. “You may still have Hadoop or separate data warehouse…but when things need to happen in real time, you need a simpler framework. That’s where Citus fits in.”