The Motivation for Native Graph Databases
Building a dependable database management system is a difficult task. We must be aware of the design trade-offs in the construction of a database management system and understand how those trade-offs impact end-user problems that we want to help solve.
Each database management system chooses – knowingly or otherwise – differently from the wide array of design choices. As such, not all databases are created equally. Knowing which database is appropriate for your needs is a function of understanding your application’s goals and requirements balanced against the trade-offs the database designer has chosen.
From the rate of growth in the graph database category, it’s becoming clear that most organizations want to take advantage of connections within their data and are exploring multiple means of doing so. Those enterprises that want to build behavior and decision-making applications based on a live, real-time evaluation of connected data will look for integrity, performance, efficiency and scalability as key attributes in selecting a graph database.
So which database is best for your solution? Fortunately we can distill our prior experiences that can help guide us towards a good technology choice. Primary amongst these is the native and non-native design of the database management system.
As the name suggests, native graph databases are those specifically built to handle graph workloads across the entire computing stack. The opposite, non-native, databases come in two flavors: those that affix a graph API on top of an existing, native-to-other-kind of database management system, and those that claim multi-model semantics where one engine claims support for several kinds of data model.
We observe a considerable difference between the architecture of native graph storage and querying and non-native. Unsurprisingly, native technologies tend to perform queries faster, scale better (retaining their hallmark query speed as the dataset grows in size), and run far more efficiently upon less hardware. Conversely we observe non-native stores optimize for their primary workload at the expense of graphs, or grapple with the complexities of multiple first class models and often falling short on each.
The Native Graph Database Advantage
A native graph database is distinguished by an exclusive preference to serve graph workloads across its entire stack. From query language through to the database management engine and file system considerations, and from clustering to backup and monitoring, the native graph database epitomizes graph thinking.
In Neo4j, we follow this mantra: all of Neo4j’s software components are continuously graph-affined. As hardware trends emerge and evolve, it is our job to make sure that we map graph workloads onto that hardware efficiently and safely. It is our job to make sure our end-user application developers can productively and humanely work with the graph. It is our job to make sure that your precious data is safe and that the system as a whole is dependable. We are able to do this because we can optimize every layer of our stack for graphs – no responsibility is abdicated to non-graph native software.
Native Graph Storage
Graph storage refers to the underlying structure of connected data persisted (often, but not always) on stable disk. When built specifically for storing graph data, it is known as native graph storage.
Neo4j is designed to use the file system in a way that is expressly sympathetic towards graphs and so is both highly performant and safe for graph workloads. For example a traversal across a relationship in Neo4j has constant cost irrespective of the size of the graph and that constant cost is tiny because of mechanical sympathy between the software and hardware.
Conversely, graph storage is non-native when it is optimized for an alternate storage model, such as a relational, columnar, document, or simple keys and values. These structures are not optimized for graph storage, instead being optimized for their native model.
To reify columnar, relational, document, or key-value data as a graph, the database management system has to perform costly translations to and from the the primary model of the database. While implementers can try to amortize the cost of such translations through radical denormalization, this non-native approach typically leads to high latency when querying graphs. It also has very well-understood safety risks when persisting graph data – risks which radical denormalization amplifies.
We are sympathetic to the operations teams who might currently be more familiar with a non-graph backend. But the disconnect between graph data with non-graph storage is problematic for both performance and scalability. Our research indicates that the the only way to ensure data safety is to update the graph via ACID transactions. Maintaining relationships between records is far more demanding than weaker-than-ACID consistency models can provide.
Native graph databases like Neo4j include transactional mechanisms to ensure that data safety remains impervious to network blips, server failures, and even contention from competing transactions or scaling decisions. Non-native graph architectures, especially the variants that are built on eventually consistent stores, can (and will eventually) corrupt graph data.
Furthermore, native storage allows Neo4j to adapt its implementation to the evolving hardware architectures of tomorrow. As memory and disk technology evolves, Neo4j’s implementation evolves to take advantage of those to support graph workloads. In coming years we fully expect to adapt Neo4j’s native storage model to emergent novel disk storage platforms and memory architectures like non-volatile RAM.
Native Graph Query Processing
Native graph query processing is another key element of graph technology, referring to how a graph database describes, plans, optimizes and executes queries. With a native graph system, every layer of the architecture – from the user’s expression in the Cypher graph query language to the files on disk – is optimized for storing and retrieving graph data.
Through radical denormalization, non-native graph databases may try to avoid mechanical penalties. For example a non-native store may be optimized for three levels of traversal depth by duplicating and co-locating data or by creating increasingly arcane set of indexes for each query. Beyond that, the traversal performance reduces drastically whereas the native approach provides consistently high traversal performance at any depth. The upshot is that initially queries seem performant, but then there is a mechanical cliff edge which causes latency to rapidly increase for reasons seemingly innocuous to the end user.
We have seen this first hand. Our early implementation of a graph database (back in 2000-2003) were non-native with a graph API atop a relational database. When our queries involved around three levels of depth or more they degraded substantially in performance. Worse, doing something we take for granted in Neo4j, reversing the direction of a traversal is also extremely difficult with non-native graph processing in a RDBMS. To be able to reverse traversal direction, you must either create a costly reverse-lookup index for each use-case, or perform a brute-force search through the original index. Neither are performant or maintainable over time.
Key Advantages of Native Graph Architecture
A native graph architecture provides many other advantages that make it generally superior to non-native graphs.
- Minutes-to-Milliseconds Performance – Native graph databases handle connected data queries much faster than non-native graph databases. Even on very modest hardware, native graph databases can easily handle millions of traversals per second between nodes in a graph on a single machine, and many thousands of transactional writes per second.
- Data Integrity for Graphs – Native graph databases that support ACID transactions means that once a transaction is complete, its data is consistent and durable, which may involve multiple servers. Transactions also occur concurrently through its transaction infrastructure which means that transactions do not interfere with each other. Even deadlocking transactions are automatically detected and rolled back. In the event of a fault, no partially written records will exist.
- Efficiency – Unlike non-native graph models, native graph databases can deliver constant time traversals with index-free adjacency without complex schema design and query optimizations. This intuitive property-graph model eliminates the need to create additional, and often complex, application logic to process connections.
Why Native Versus Non-Native Matters
It’s often convenient to think that non-native graph technology may be “good enough” particularly if you have that non-native technology installed for its native use-case. But we see that the data tends to grow over time, and today’s datasets are more variably structured, interconnected and interrelated than ever before.
We believe that the value is in the connections – a non-native approach hamstrings that value. A native graph database will serve you better over the long-term and won’t require extraordinary hardware investments.
The choice between native versus non-native graph technology is not always clear but we believe that enterprises hoping to get the most out of the connections in their data will find the integrity, performance, efficiency and scaling advantages of a native graph database are crucial for long-term success.
About the author: Jim Webber is Chief Scientist at Neo Technology working on next-generation solutions for massively scaling graph data. Prior to joining Neo Technology, Jim was a Professional Services Director with ThoughtWorks where he worked on large-scale computing systems in finance and telecoms. Jim has a Ph.D. in Computing Science from the Newcastle University, UK.