The New Math Driving NoSQL Analytics
NoSQL databases are extremely popular among developers thanks to their flexible schemas and rich data types like JSON. But those same attributes make getting data out of them using traditional SQL queries a real pain. Now vendors like SlamData are helping by extending the algebra behind SQL into a dialect NoSQL databases can understand.
Jeff Carr and John De Goes founded SlamData about 3.5 years ago with a relatively simple goal: make it easier to do analytics on NoSQL databases like MongoDB. While these databases were proving to be extremely popular among developers, they were causing problems among business analysts, who found that traditional BI and visualization tools would not work with them directly.
Since Tableau and Qlik and other SQL-powered BI tools can’t directly read JSON data, companies will typically resort to building elaborate ETL scripts that extract data from MongoDB or Couchbase and load it into a relational data warehouse, where the traditional SQL-powered query and visualization tools can have at the data. (The NoSQL databases are also offering ODBC drivers, but that approach isn’t panning out, Carr says.)
There are several problems with the ETL approach, not the least of which is the time and expense of building and maintaining the data transformation pipelines. And there’s also the pesky little matter of flattening the JSON data. JSON data typically consists of rich nested structures, which are essentially ruined when the data is hammered into the relational form that BI tools expect.
Carr and De Goes looked at the problem, and realized that core impedance mismatch was essentially a math problem. The core math behind SQL is relational algebra, Carr says, but that simply doesn’t translate to the JSON and NoSQL world.
“Relational algebra works quite well, except that one of the baked-in things is your data is always flat and it’s always in two dimensions and it has a fixed schema,” Carr tells Datanami. “Of course, with JSON it doesn’t have any of those things, ever.”
Developers love NoSQL databases because they don’t enforce a fixed schema. “But with every analytic tool you can name, when you write a query the very first thing it does is goes to the database and says, ‘Tell me your schema Mr. Database, and it needs to be fixed.'”
Carr and De Goes considered these impedance mismatches, and said to each other, “Well, how do we solve the problem?” Carr says. “And we basically built something called multi-dimensional relational algebra.”
MRA, pronounced “Murray,” essentially extends the SQL dialect so that it can understand data types like JSON that can exist in more than two dimensions.
“Rather than blow everything up and start over, we said, ‘Let’s take what’s good in relational algebra and extend it,'” says Carr, the CEO of SlamData. “So we mathematically extended the capabilities in relational algebra, and basically made it so it can think about data in more than two dimensions. If you think about the way relational algebra works, it allows you to do a set amount of functions in two dimensions. In MRA, you can actually take any of the same functions and lift them into multi-dimensions.”
MRA is open source, and the math behind it is published on GitHub. Carr says it’s survived the scrutiny of other mathematicians and computer scientists, and is on its way to being accepted by the mainstream. The technology forms the core of SlamData’s software, which compiles SQL-like queries generated by SlamData’s visual interface to run natively on each supported NoSQL database. Currently SlamData supports MongoDB, Couchbase, MarkLogic, and Spark for HDFS, while support for ElasticSearch on the way.
“We present everything through a visual interface, but under the covers it’s actually using a SQL dialect that’s been modified to work for JSON or semi-structured data, nested documents, arrays things of that nature,” Carr says. “You can do complex aggregations and joins. It’s much more than counting things or DevOpsy kinds of things.”
The magic of SlamData is it exposes SQL BI style analytics, but it doesn’t change the data. “We’re not trying to flatten it out or virtualize it or otherwise change it,” Carr says. “That’s where a lot of the problems arise with traditional legacy solutions. But we do expose it in a way where, if you’re somebody who’s comfortable with SQL, you can use SlamData directly on JSON sitting in Mongo or Couchbase or MarkLogic or Spark and use it as efficiently if you’re using it on Postgres or SQL Server.”
SlamData compiles the SQL-like functions generated by its visual analytic interface to run on each database’s native query engine, so this approach scales just as well as the underlying database does. The compiler was written in Scala, so it’s very fast, and adds just a tiny amount of overhead to the operation.
It took SlamData two years to develop the software, so it’s only been selling the product for a year and a half. It currently has about 40 customers, including Cisco, which was having difficulty analyzing complex data sitting in arrays in a MongoDB database. SlamData was able to deliver a complex dashboard for Cisco in just two days, Carr says.
The problems that NoSQL customers are having with the driver-based approach to accessing data from BI tools is helping drive sales at SlamData, Carr says.
“All the NoSQL guys started down the driver path four to five years ago and it’s just been a universal disaster,” he says. “We talked to companies every day who say ‘We tried to use the ODBC drive from XYZ company and it has no ability to understand the complex data types that JSON supports.’ That’s been a common theme. If that problem was solved, and drivers worked, we wouldn’t exist.”
While the NoSQL database market is undergoing some consolidation (Basho just called it quits), there’s no sign that the technology is slowing down. MongoDB, Couchbase, MarkLogic, and Datastax are all fighting for market share and the hearts and minds of developers. As the Web and mobile and IoT applications built on these NoSQL backends grow, it’s natural that customers want to perform some analytics on them. That provides the growth potential for companies like SlamData.
“It’s actually something that’s going to be required as we move forward in modern data,” Carr says. “I think the days of purely relational algebra are kind of numbered. I think we need a more powerful algebra and I think we’ve delivered that.”