Neo4j Drives Simplicity with Graph Data Science Refresh
Graph data science is an emerging field with a lot of promise, but it’s being hamstrung by the need for practitioners to have lots of data engineering and ETL skills. Now Neo4j is hoping to drive that complexity from the equation with the general availability of Aura Data Science, it’s first cloud-based graph data science offering. It also is launching Graph Data Science 2.0, which brings additional simplification.
It’s been two years since Neo4j launched the first release of Graph Data Science (GDS), the company’s first foray into graph data science. GDS was basically a plugin for Neo4j’s property graph database that allowed users to run machine learning algorithms atop connected data stored in the database, and also to create graph embeddings and generate insights from the graph.
While early adopters liked the graph data science capabilities exposed in GDS, many of them felt flummoxed by all the extra data work that surrounded it, says Alicia Frames, Neo4j’s product manager for graph data science.
“One major barrier to adoption that we’ve seen has been data scientists really struggling with what do you mean deploy a database? What are all these monitoring things? I don’t understand? I don’t do this,” Frames says. “Data scientists are not database administrators. They’re not software developers. They’re not machine learning engineers.”
Neo4j has strived to eliminate much of that complexity with AuraDS, a cloud-based version of GDS. It is being offered first on Google Cloud, and will be followed on other cloud platforms, Neo4j says.
When users log into AuraDS, they’re presented a GUI console where they are walked through the database setup, Frames says. They’re asked how many nodes and relationships (or vertexes and edges) they have, and what types of data science tasks they might want to run, such as graph algorithms, embeddings, or machine learning. The product will suggest a certain sized database, and the user can accept it or request a separate one.
Once the database is set up, the offering walks the user through the next step: importing data. There is a Spark connector for importing data from a data warehouse, a Kafka connector for pulling in streaming data, and another connector for pulling in data from BI environments, Frames says.
Once the cluster is set up and the data is starting to load, then the data scientist is free to start experimenting with data. “We’ve really tried to reduce the friction,” Frames says. “It’s just press a button to create your instance. Press another button for import. And then you can focus on value.”
A user typically needs some a priori idea of how their data will map to a graph database, Frames says. Nodes are typically nouns, while relationships are the verbs, she says. But users don’t need to be all-knowing regarding how their data maps to graph, because the software is helpful in guiding the user through more complex data transformations that can frustrate less experienced graph data scientists.
“Let’s say you have a knowledge graph that’s everything your company knows, but you don’t know what’s going to be relevant for your data science project,” Frames tells Datanami. “Our analytics workspace lets you flexibly reshape that so you can say ‘Okay, out of my kitchen sink everything I’ve got, what I want to load into memory are people and items, because I want to do recommendations, but I want to collapse their relationship so there’s just one relationship between each person and item, and I want it to be the weight of that relationship is the sum of all those individual relationships.’
“So the data science platform gives you a lot of capability to create graph from non-graph data,” she continues, “or to transform a general purpose graph into this special purpose graph for your project.”
AuraDS is based on GDS 2.0, which is also being launched today. GDS 2.0 introduces a host of new features, including a new Python client, which will probably interest most data scientists. But another important new feature may be the new data pipeline catalog and a new syntax, which will simplify how models are configured, trained, and deployed.
For example, say a user wants to create a graph model to predict the likelihood of fraud in bank transactions. She would start by typing GDS.create.linkprediction.pipeline, and then enter their features while specifying which algorithms and data features they want to use, Frames says.
“I want to use a graph embedding. I want to use PageRank. I want to use the person’s age and their bank account balance to make that prediction,” she says. “And then they can specify, how do I want measure how good this model is? I want to use area under the precision recall curve. And then it says, which techniques do I want to use? Logistic regression? Random forest? And then they basically write ‘model.training,’ and we iterate through all of those features they’ve supplied, possible ways to combine those features, the models they specified, and the range of hyperparameters for those models to then find the best performing model and save that for the user. And then they can apply it.”
Without a data science platform like GDS, that process would take many more steps, including pulling data out of a database into a dataframe; reshaping the data for your choice of a data science platform; the feature selection stage; merging that back with the dataframe; conducting the training manually; writing more code for the exploration of space; and then integrating with the database again for inference.
“So it’s really about reducing friction,” she says. “That’s a major theme that we’ve been building on, is how do you make it easier and easier and more foolproof to come up with models to predict graph native machine learning? How is this structure of my graph going to change? In this release, pipelines are first and foremost a way of saying ‘These are all the steps I want to do. Assemble them for me and come up with best result.’”
GDS is beginning to look like an AutoML platform, which automates many of the steps involved in the data scientists workflow, but designed for the graph data scientist. A future release will focus on auto-tuning, Frames says.
“We’re very much focusing on supporting that lifecycle from proof of concept,” she says. “It should be really simple for me to get my data and find value quickly, all the way through to production, which is, hey, I’m trying to build this model and it’s good. I want to be able to persist it to my database and publish it and share it with my team. And Neo4j can support MLOps around managing multiple models and applying those models to incoming data.”
This release also brings better integration with transactional databases, and the capability to pull data into the graph database, analyze it with graph data science techniques, and then store the results in the graph cluster, Frames says.
“What we’ve said is here’s an automated way you can connect a read replica to run data science,” she says. “We’ll do server-side routing internally. You’ll store those results back. Making it so an end user doesn’t have to pick between transactional and analytical. They can say I have the right architecture for the right problem.”’
The world of graph data science is full of promise, and Neo4j is hoping to ride that wave of adoption to success with GDS 2.0 and AuraDS. The company is the most well-established graph database vendor in the market, and now it’s looking to leverage that experience in developing new data science use cases, which count for about 20% of new uses at Neo4j, Frames says.
“Fingers crossed that AuraDS is a big step for us in overcoming” the friction, she says. “Knowing how to drive a car is not the same as being a mechanic. Knowing how to do graph data science is not the same as being a DBA. Up until this point, you really did have to know both. So we’re hoping it really unlocks a lot of that.”