Beyond Titan: The Evolution of DataStax’s New Graph Database
DataStax’s 2015 acquisition of Aurelius–the company behind the TitanDB graph database–was a clear statement about the importance of graph databases to Cassandra customers. Today marks the official announcement of DataStax Enterprise Graph, which the company is counting on to solve a number of operational and analytic problems for its customers, even if the database doesn’t look much like TitanDB anymore.
When DataStax started incorporating the Aurelius development team and the TitanDB code in the spring of 2015, the DataStax development team wondered how much of the code they’d be able to reuse, says Martin Van Ryswyk, executive vice president of engineering at DataStax. “A year later, it was pretty much an entire rewrite,” he says.
According to Van Ryswyk, the rewrite should net DataStax customers performance and scalability gains compared to plain vanilla TitanDB, which continues as a separate project. The advantages stem from the way DataStax could now integrate the graph database with DataStax Enterprise (its commercial version of the open source Apache Cassandra database).
“Titan had on paper a very nice abstraction layer where you could have the processing engine decoupled from the storage methodologies,” Van Ryswyk explains. “So if it was HBase or Cassandra or DynamoDB, it would all work the same. Well, that’s great. But when you do that, you’re kind of limiting yourself to the least common denominator.
“We rewrote it,” he continues, “saying ‘We know that Cassandra is the storage engine under the covers, so how can we take advantage of that to push down filtering and querying, and do very Cassandra specific things to get more performance than it could than if we had to treat everything at an abstraction layer?’ We knew there was a lot of low hanging fruit for making it better if you didn’t have to support five different systems.”
In particular, the Aurelius folks could leverage existing Cassandra strengths in the areas of horizontal scalability, data replication, and data sharing among nodes in a cluster. And since Cassandra already had search capabilities via its hooks into Apache Solr, and it already had advanced analytics by way of its Apache Spark integration, there would be no need to include those capabilities.
“We know we have search, we know we have analytics, we know we’re on top of Cassandra,” Van Ryswyk says. “How would I end up doing things differently given those constraints are gone? It just ended up being a complete rewrite, and for that reason, we haven’t committed any of it back to Titan open source, because it’s not possible. It’s just completely different.”
Popular Graph Workloads
DataStax is in the process of benchmarking the new graph database, and should have some figures at some point to back up the performance claims. In the meantime, DataStax is evangelizing the significance of graph database to solve some of its customers’ toughest data challenges.
Graph databases are inherently good at executing certain types of workloads, in particular responding to queries that require finding the links among entities, such as people and products. In some cases, the graph model mimics the natural order of things, which makes it a much more elegant way to tackle some problems.
This is particularly true in some types of workloads that would otherwise require huge, complex SQL joins. “There are places where a graph model is a much more natural and performant,” he says. “As you start getting into more and more complex scenarios, suddenly relational just starts looking insane, and you see the beauty of a graph database.”
Those modeling advantages of graph map particularly well to three specific workloads that DataStax expects to be popular with customers: Customer 360, product recommendation, and fraud detection.
DataStax doesn’t expect its customer to be doing much exploratory analysis using the graph database. The use case is predominantly building a real-time application. But even though the focus is on supporting transactional systems (i.e. OLTP), that doesn’t’ mean there won’t be some analytic (i.e. OLAP) elements at playl.
“The end goal is still a real-time application,” Van Ryswyk says. “But to develop that, you still have to do ad hoc analysis and OLAP graph stuff to help you build your model and figure out what it is you want to find and what’s the best way to do it.”
Dev Tools and Neo Too
DataStax Enterprise Graph, which ships June 28 as part of DataStax Enterprise (DSE) version 5, includes some other new features, including DSE Advanced Replication (which ensures data synchronicity among geographically disparate nodes in a cluster); Advanced Server Automation (which simplifies the deployment of DSE, particularly on large, virtualized machines); and updates to Apache Solr and Apache Spark connections.
In other news, it was announced yesterday that Apache TinkerPop has become a top-level project at the Apache Software Foundation. TinkerPop, of course, is the open source project behind the language used to program graph databases like TitanDB and DataStax Enterprise Graph.
DataStax donated TinkerpoP to the open source community after acquiring Aurelius, which developed TinkerPop. The Santa Clara-based company decided that it made good business sense to put the development tools into the open source realm while keeping the graph database in its commercial offering.
“Graph is exciting and growing and it’s one of the fastest growing database models,” Van Ryswyk says. “[But] it’s still nascent. We didn’t think whole langue war of different vendors trying to push their own proprietary language to lock people in was a good thing, so we’ve been working with competitors and partners and the ASF to promote Tinkerpot as the neutral language that everyone can build on, then we’ll work on building the best mousetrap in the background to actually execute the queries.”
It’s worth mentioning that Neo Technology, the company behind the most popular graph database and the company that DataStax will be competing most heavily against for graph database customers, has also released its development language, called Cypher, into the open source community.
DataStax hopes to take market share from Neo with the message that distributed, scale-out graph databases are the way to go.
“The market will play out,” Van Ryswyk says. “They’re a scale-up architecture and we are bringing to market a scale-out architecture. We have competed with composing in the past where they have a scale-up [architecture] and we have scale-out [architecture. MongoDB is not going anywhere, but there are people who understand very clearly where to use MongoDB and will use DSE because of its horizontal scaling. I think the same thing will happen in graph.”