Neo4j Reaches Out to Spark as Platform Ambitions Grow
Neo4j is using its GraphConnect New York conference to showcase two major announcements this week, including a new initiative to support the design and execution of graph queries in the Apache Spark environment and the creation of a platform to build an ecosystem around its core graph database.
Neo4j is a leader in the development of graph databases that can support an array of big data tasks, such as building recommendation systems for retailers or detecting fraud for banks. Its graph technology is used to support hundreds of production applications around the world at Fortune 100 enterprises like Wal-Mart, Cisco, eBay, Comcast, and Lufthansa, not to mention NASA.
The Silicon Valley company, which was previously known as Neo Technology, created a SQL-like language called Cypher to enable developers to build applications atop its entity graph database. Cypher has since been open sourced by Neo4j, and is being supported by other vendors, including SAP and Redis, as a standard language for developing graph systems.
Today Neo4j announced that the openCypher project has developed a way to let Cypher run directly atop Apache Spark. According to Neo4j VP of Product, Philip Rathle, Cypher for Apache Spark enables users to map their resilient distributed datasets (RDDs) into a graph view, and then to run Cypher queries over that graph view.
In addition to enabling Cypher users access to the wealth of data that may be present in a Hadoop cluster (Hadoop clusters remain a popular place to run Spark), the new project also opens up the iterative Spark development process to graph-type queries, the Neo4j VP says.
“In the Spark world, typically you’ll run a query and then those results will get persisted in memory,” he says. “Then you run another query on top of that image, which is usually tabular, and you get another results. You sort of iterate through multiple steps, and it’s all held in memory. We’ve enabled that in Cypher so you can actually run a Cypher query, return data as another graph, then write another Cypher query that accesses the in-memory graph you just created.”
Running these iterative graph queries will accelerate the graph development process, while allowing data scientists and others to leverage the advantages that graph processing holds when it comes to data locality and pattern recognition, Rathle says.
“The users we’ve introduced this to so far see this as a way to really accelerate the testing of their queries that they ultimately want to run as graph queries, operationally, in real time, rather than having to do lots of steps in bringing data together, do a bunch of things manually, and then run something else manually,” he says. “You can just do this pretty organically.”
Neo4j is releasing Cypher for Apache Spark under an Apache 2.0 license. The hope is it will eventually become a full-fledged member of the graph component of the Apache Spark project, Rathle says. “I expect it will fully work its way into the Spark community,” he says.
Spark’s graph component, called GraphX, is its own distributed graph execution system. The software does many of the things that another open source graph project, called Giraph, can do, Rathle says. However, GraphX is more complicated. “It requires a double PhD to know what to do with,” he says.
Cypher fits into Spark’s graph component because Cypher and Graphx solve very different problems and complement each other, Rathle says. “They’re totally non-overlapping,” he says. “We actually looked for just short while at extending Graphx, putting Cypher on top of GraphX. But that doesn’t make any sense. And vice versa — you can’t run the kinds of thing you can in Graphx in Cipher.”
On the platform front, Neo4j unveiled its Native Graph Platform. While the Cipher language and the Neo4j database provide core underpinnings for writing and running graph applications, a platform is a bigger thing. To that end, the Native Graph Platform provides ways for Neo4j and its growing partner ecosystem to plug in related products, including ETL, analytics, visualization, and discovery tools.
“It’s not a new thing that, in order to be successful with a database technology, there’s a lot of peripheral technology” required, Ratlhey says. “We’ve been working toward the vision where we have a rich product and feature set across a whole set of different areas in order to help all the people in different roles who are working with the database, to be able to do more with it.”
Neo4j is taking the lead on this platform vision, and is using GraphConnect to showcase some tools its working on. That includes the new Neo4j ETL product that will be able to pull data from Hadoop and other raw data sources; an RDBMs-to-graph connector that will let data move between relational databases and Neo4j; and a new desktop interface dubbed “Mission Control” for handling an array of Neo4j development and visualization tasks. These products are expected to GA in 2018.
Neo4j also unveiled a Spark adapter designed to make it easier for users to move data from Spark into the Neo4j database. This will be useful for those clients who want to operationalize their data science work done in Cypher for Apache Spark by running it on Neo4j’s graph database platform. “When you get to the point where you want to move this data from the Hadoop world into operational graph world, you can essentially push a button and it’s all yanked over, out of Spark memory, into Neo4j,” Rathle says.
The main integration points for partners to hook into Neo4j’s Native Graph Platform are the core APIs exposed by the Neo4j database and the new desktop app, Rathle says. “You can run any Neo4j app, like the Neo4j browser, and it’s also intended for third parties like our visualization tooling vendors to be able to plug their technologies in,” he continues. “The important things for users is to be able to visualize graphs. We have a strong partner ecosystem around that.”