This Data Lake Runs On Graph
When some people hear the words “data lake,” they assume that it must run on Apache Hadoop or maybe Amazon S3. But the folks at Cambridge Semantics are finding traction in certain industries with a data lake solution that runs atop a graph database.
Cambridge Semantics was founded in 2007 by a group of technologists from IBM‘s Advanced Technology Internet Group. Under the guidance of founder and CTO Sean Martin, the company developed solutions based in part on Big Blue’s research into semantic technologies and knowledge management solutions.
Things changed quite a bit for Cambridge in 2016, when the Boston, Massachusetts company acquired SPARQL City, the San Diego developer of a parallel graph query engine that uses SPARQL, a language used for querying data stored in Resource Description Framework (RDF) format.
Founded by Barry Zane, of ParAccel and Netezza fame, SPARQL City provided Cambridge Semantics with the parallel processing chops to go along with its sophisticated semantic tools. Those two forces came together in the recently released Anzo Smart Data Lake (ASDL) version 4.0, which Cambridge unveiled at the recent Strata Data Conference in New York City.
Martin says a confluence of events has created a golden opportunity for graph and semantic technologies to have a big impact on the market for big data analytic solutions. “It just hasn’t been viable until 24 month ago,” he tells Datanami. “We could do these pretty miniature things with 10 to 20 million facts, but now we can deal up into the hundreds of millions of triples in fact.”
Triples, of course, are the entities that are stored in Cambridge’s graph database in an RDF format and queried through its Graph Query Engine. Because the relationships have already been triangulated in the graph database, it’s a much more refined and efficient way to access previously recorded knowledge than, say, writing SQL queries to join dozens of tables stored in a relational database.
Cambridge recently conducted a test to showcase the speed advantages over relational databases, in this case Oracle Exadata. The company clocked how long it took to load a trillion triples and then execute a string of queries on the two databases. In took two hours on Cambridge’s ASDL solution, while it took about a week to complete on Exadata, Martin says.
Technical improvements in hardware and software are partly responsible for graph databases becoming more visible solutions to some types of analytical problems, Martin says.
“I think the breakthrough was simply the horsepower of the machines made a lot of the things people wanted to do with graph to be more viable,” he says. “Certainly in the case of GQE, it’s the interconnect and the cheapness of the RAM — you can fit everything in RAM now. There have been some technical breakthroughs at the software the level, too.”
The market is also more amenable to hearing the graph database song at this point — particularly for firms in the pharmaceutical and financial services industries who have been struggling to use relational data warehousing technology to answer ever-changing questions.
“We generally do not prepare the data in a way to answer questions for companies,” Martin says. “What we do is we basically build a description of the data and we load the data into that description. We call those models — the description or ontology. Now you can ask any question you like on any set of dimensions, [perform] as many joins as you like, against that data.”
That results in a profound change in the type of questions customers can ask. For example, in the pharmaceutical industry, a customer could load hundreds of different entity types into ASDL, including descriptions of drugs, lists of drug makers, lists of publications and patents, competitors, and information about hospitals.
Once that data is loaded, a customer could use the ASDL interface to semantically ask a question, such as “Where is a hospital where I can run a trial that my competitors aren’t running, and a hospital that has patients of this kind and knows how to administer drugs of this kind. Where can I find that?”
“That’s a complex question, and I may ask different questions tomorrow,” Martin says. ” Imagine writing the SQL for that question. You have to change the schema, get more data loaded, get the DBA to write the code, fiddle away with it for a while. What I just described is something an end user can do for themselves.”
Pharmaceutical companies and financial services firms are the two biggest customers of Cambridge Semantics’ offerings, followed by governmental agencies and retailers. The biggest problem facing banks and other financial services firms is they never know what the regulator is going to ask them next.
“One month the regulator says they want to know about this, next month they show up with a different set of questions,” Martin says. “It’s complex data landscape that is very difficult to just rejigger them and ask a new set of questions, so they’re looking for new ways to get people flexible and quirk answers to questions.”
The company is introducing several new features with ASDL 4.0, including:
- new ETL processes for ingesting data into the lake;
- a new data catalog that makes it easier for users to browse and discover new data sets;
- new Graphmarts designed to make it easier for users to share and discover data;
- new data layers that work with Graphmarts to provide cleansing, transformation, and access control;
- and a new Hi-Res Analytics offering designed to let users get answers to ad-hoc questions in real time
According to Martin, the benefits of asking more questions of bigger data faster and more easily extend to multiple parts of the organization.
“There’s just a massive amount of efficiency that’s good for everybody,” he says. “I think customers have been kicking the tires for years in disappointment because it held such promise. But it’s only now that we can actually say we’re at parity with the traditional technologies, which were not semantic and not graph.”