Is GOLAP the Next Wave for Big Data Warehousing?
The 1990s and 2000s saw the rise of the relational databases for transaction processing (OLTP) as well as analytical processing (OLAP). As the volume and variety of data explodes in the 2010s, database experts are looking to parallel graph analytic database– what some are calling GOLAP data warehouses — to enable the timely extraction of insights.
The biggest proponent of the GOLAP data warehouse concept at this point is Cambridge Semantics, the company behind the Anzo data lake offering. The Boston, Massachusetts company’s GOLAP offering includes a parallel graph database, called AnzoGraph, along with a collection of tools that help to automate core functions, like data modeling, transformation, and query-building (the rest of the Anzo Smart Data Lake suite).
Barry Zane, the vice president of engineering for Cambridge Semantics (via its SPARQL City acquisition), knows a thing or two about developing parallel databases. After all, Zane was the founder and CTO of ParAccel, the massively parallel processing (MPP) column-oriented relational database that was bought by Actian and forms the basis for Amazon Redshift, and a founder and the VP of technology for Netezza, the MPP column-oriented database acquired IBM.
As Zane explains, parallelism was the critical architectural breakthrough that enabled MPP database users to get answers to SQL queries within sub-second or several-second timeframes. “When we run on a cluster, which is our typical deployment, every query is able to leverage every core on the cluster,” he tells Datanami. “You can’t wait overnight for your analytic queries. You need interactive online capabilities.”
Just as parallel MPP databases enabled massive OLAP processing for data that mostly originated in row-oriented transaction-oriented databases, the advent of graph OLAP warehouses, or GOLAP warehouses, now is bringing parallelism to bear on data originating in transaction-oriented (sometimes called operational) graph databases, such as Neo4j, Titan, or Amazon Neptune, Zane says.
“We can say that a parallel graph database is a logical evolution of the notion of doing parallel analytics on data,” he explains. “Obviously, if you’re a graph guy, you know that graph is just a much better representation of real-world data than trying to squish it through the schema in a relational system.”
By adding parallel processing to a graph database – which has inherent architectural advantages because of the way it stores linked data — GOLAP can deliver super-charged analytic capabilities, Zane maintains.
“So you have all the benefits of graph, and now you have the ability to do large-scale analytics against it,” he continues. “That’s been one of the biggest problems in the graph world, since graph began. Building these OLTP systems that can collect the data and do simple analytics on small points of the data – that’s been pretty straightforward and that proves the value of graph. But the value of being able to do large-scale analytics really required a parallel system, parallel databases.”
The main alternative method to the GOLAP approach — at least for connected data that’s already stored in a graph data base — involves shunting data off the graph database and delving into the Hadoop cluster to run batch analytics, Zane says.
“But at the very least, that’s an extremely time-consuming, complex, and painful process,” the San Diego-based technologist says. “It’s simply more natural if you have a graph database that has graph data to be able to tap that data from another graph database, so you don’t go through weird data transformation and losses and complexities. You’re able to do analytics on graph data as graph data.”
That viewpoint, of course, presupposes that the data exists in a graph format in an OLTP graph database to begin with, which surely is not always the case. “Without a doubt most of the data in the world today is not in OLTP graph databases,” Zane says. “Most of the data in the world is currently in relational database,” (which is clearly a reference to structured data, since unstructured data sources, like video footage, vastly outweighs structured sources).
Cambridge Semantics supports queries written in SPARQL, a Resource Description Format (RDF) query language, as well as SQL. But having analysts sit around writing SQL and SPARQL queries is not the only way to get insights out of the Anzo data lake.
Just as today’s Teradata and Redshift data warehouses hold data that companies want to use to train machine learning models, tomorrow’s GOLAP data warehouse can also serve a critical role in preparing and storing the data that will be used to train the next generation of machine learning models
“The sheer power of AnzoGraph allows you to do interactive de-normalization of that rich model of harmonized data, where you have many different data sources,” says Cambridge Semantics CTO and co-founder Sean Martin. “That de-normalization is a key part of the data scientists’ job when they’re setting up to train a model.”
The Anzo suite includes a query generator that makes writing those de-normalizing queries much easier, Martin says. “From the point of view of data preparation and feature engineering, we believe we have a killer app in terms of speeding up the ability of data scientists and data engineers to actually get data into a de-normalized form, which becomes the training data,” he says. “It’s taking what could take days or weeks and getting results in hours.”
One of Cambridge Semantics’ customers is a pharmaceutical company that must comply with FDA regulations that require it to quickly address any issues or problems with its drugs that it become aware of. The company must quickly ingest a huge onslaught of reports that come in a large variety of formats – text, faxes, emails, social media – and deal with them as quickly as possible.
“Now they’re also starting to use it with machine learning, together with the graph database, to help organize that data as it streams in,” Martin says. “That’s an example of a use case that they wouldn’t have dreamed of tackling four to five years ago when it would have required a huge manual process. That’s now becoming commonplace. We’re seeing so many use cases that involve unstructured text-oriented selects as one of the data source types.”
While machine learning algorithms want input data presented in a tabular, de-normalized form, real-world data doesn’t originate that way. The big advantage of a graph database is that it doesn’t require the user to hammer the data down into its basic elements just to store it. There are ways in Anzo to flatten the data for export to machine learning systems, but users aren’t forced to forgo that richness up front.
Martin thinks that’s what will give GOLAP legs – the capability of the graph database to handle highly variable data.
“That’s why we think it’s the successor to the relational data warehouse, because we don’t have all the artifacts of tables and keys and all that stuff,” he says. “Instead we’re just looking at the data in a very logical, true-to-life representation of whatever data we could get. It removes all the crud.”
The market certainly seems primed to benefit from analytical graph databases. At this point, Cambridge Semantics appears to be the only vendor using the GOLAP term, which certainly is catchy. It will be interesting to see if that continues to be the case moving forward.