Graph Analytics Poised to Solve Tough Big Data Problems
Hadoop has emerged as the go-to platform for sifting through massive amounts of data on commodity machines. But when it comes to certain types of analytic workloads with open-ended problems, nothing beats a graph database, which may or may not run on Hadoop. The product category is maturing quickly and well positioned to make major inroads in the field of analytics over the next few years.
There’s no denying that graphs are hot. “Graph analysis,” analyst firm Gartner says, “is possibly the single most effective competitive differentiator for organizations pursuing data-driven operations and decisions after the design of data capture.”
There are a number of commercial and open source graph databases available on the market, and perhaps half of them run on Hadoop or make use of HBase. On the open source front, there’s Titan, a distributed transactional graph database, and Apache Giraph, which uses Hadoop and MapReduce to process data. Apache Spark includes a graph database, called GraphX, that comes preloaded with algorithsm.
On the commercial front, there’s GraphLab, which recently won third place in the Strata + Hadoop World Startup Showcase. Cray‘s YarcData appliance ships with a graph that has been used to crunch Major League Baseball data. Then there’s Sparksee (formerly DEX), which was developed at the Polytechnic University of Catalonia and is being commercialized by Sparsity-Technologies, as well as a number of hybrid NoSQL databases also offer graph-like capabilities as a feature, including MarkLogic, Arango-DB, and Sqrrl.
But the undisputed giant in the graph field is Neo Technology, which got an early start in the field and today touts more than 50 Global 2000 customers for its Neo4j product, including EBay, Pitney Bowes, and Cisco. Two weeks ago, the company drew 700 people to its annual GraphConnect show in San Francisco.
Neo CEO Emil Eifrem says the company has experienced tremendous growth in the U.S. and international markets. “Our strong momentum, and record turnout for GraphConnect 2014, speaks to the explosive demand for Neo4j as the use cases for graph databases continues to broaden and accelerate,” he says.
Neo has the critical mass necessary to attract big companies, like Wal-Mart, which started using Neo4j in 2013 to generate personalized product recommendations on its Brazilian e-commerce website. Wal-Mart found Neo4j could easily find the commonalities among multiple data sets–including customers’ past purchases and data from current Web sessions–that the company struggled to achieve using relational technologies.
“With Neo4j, we could substitute a heavy batch process with a simple and real-time graph database,” says Wal-Mart software developer Marcos Wada in a Neo case study. “It suits our needs very well.”
There are plenty of upstarts looking to knock Neo off its perch at the top of the graph heap, including SPARQL City. The two-year-old San Diego software company was founded by analytics database pioneer Barry Zane and has strong backing from Actian (formerly Ingres), which has a financial stake in the company.
Actian this week will announce that it’s now offering SPARQL City’s graph database engine, called SPARQLverse, as part of its Hadoop-based analytics stack. This allows customers to run SPARQLverse’s in-memory graph analytics capabilities against the same HDFS-resident data that they can analyze using other Actian products, like Vector, the column-oriented analytics engine that was previously known as Vectorwise, which the company recently ported to run in Hadoop.
Zane, who co-founded ParAccel (which Actian acquired in 2013 and now calls Matrix) and Netezza (which IBM acquired in 2010), sees graph databases taking off. “There are quite a number of graph products out there and quite a number of people doing graph analytics with other tools that weren’t necessarily designed with graph in mind,” he tells Datanami. “But I think there was a sea change about a year ago when the W3C ratified the SPRQL 1.1 and RDF specification.”
Those events, Zane says, played right into SPARQL City’s strong suit, which is using the SQL-like SPARQL language to write graph queries, and using the Resource Description Framework (RDF) as the standard data format for storing semantic triples. “We see that the whole world of graph analytics is growing very fast,” he says. “And specifically the subworld of that, that’s geared around RDF and SPARQL, is growing faster.”
The company is gaining traction and experience primarily in a handful of market areas, including financial services, pharmaceutical, retail, and the government’s intelligence community.
“The kinds of problems customers are trying to solve and apply this technology for are pretty interesting,” says Vishal Daga, vice president of marketing for SPARQL City. “Like compliance in financial services, being able to do insider trading analysis but at a whole different level–not just looking at trade execution data but marrying it with geospatial and social data.”
In the pharmaceutical industry, the company is helping to speed drug development by loading data about drugs into the graph alongside data about proteins and chemical pathways. “Being able to analyze all of that together, executing link analysis to give you better candidates for drug development,” Daga says.
Zane says the rapid adoption of RDF as a standard data format surprised him and that it bodes well for graph databases in general. In the pharmaceutical business, several consortiums have emerged to expose massive amounts of data to researchers in the RDF format, including Open PHACTS and the National Institute of Health’s PubMed. In the financial services sector, the Financial Institution Business Ontology (FIBO) data set promises to boost transparency of complicated assets and transactions among banks.
SPARQL City recently ran some benchmark tests and claims its software returned results quicker than competitors, while running larger data sets. Zane says the scalability of the product comes from lessons learned building the Netezza and ParAccel products. “We’re a massively parallel product,” he says. “We can scale certainly into the hundreds of computers, whereas traditionally the graph world, the architectures have really been focused around a single computer.”
SPARQLverse will currently scale to about 1,024 nodes, enough to house perhaps hundreds of billions if not trillions of triples, or edges in a graph. That’s nearly Facebook scale. As a general purpose graph analytic database, that should be plenty to help customers find non-intuitive relationships in data in ways that are next to impossible using traditional MapReduce and relational SQL approaches.