The Polyglot Problem: Solving the Paradox of the ‘Right’ Database
Polyglot Persistence, first introduced in 2011, is now almost canon law in IT. So, what I have to say may come across as blasphemy.
Most organizations today have adopted the concept of Polyglot Persistence — deploying a variety of different data storage technologies based upon the variety of ways that they query, analyze and deploy their data. And the growth and pervasiveness of key/value, graph, time series, and JSON data stores has provided developers with abundant choices to pick the right data tool for the right job.
But the demands on us as developers have rapidly evolved and I believe Polyglot Persistence will quickly struggle to keep pace. Or, maybe, that’s already the reality you’re facing today. Why? Because the task we have is daunting and development teams are already stretched thin.
We need to be experts at a broad spectrum of database technologies, each with their own query languages. We need our applications to deliver efficient results across a growing number of use cases. And we’re asked to build and manage hundreds growing to thousands of diverse data-driven applications which require a variety of data models and deployments. Can you imagine mastering and maintaining yet another technology or query language? How about five more? Ten more?
Software is eating the world, our development efforts are only going to grow, the time and costs to integrate multiple datastores are untenable, and developers desperately need to focus on a finite set of database technologies.
More importantly, use cases are growing in sophistication and what we will ask of our data will require the use of multiple data models, maybe even instantaneously. Building workarounds into the application code to calculate queries is not sustainable or scalable. I believe it’s time organizations seek hybrid approaches that allow them to both simplify their technology portfolio and have the flexibility to choose the right data model for the right job.
Exploring Data Models
I’ve found that it’s helpful to understand which data models work well for different uses and how these can be combined.
JSON Document Databases
JSON is very versatile for unstructured and structured data. The recursive nature of JSON allows the embedding of subdocuments and variable length lists. Additionally, you can even store the rows of a table as JSON documents, and modern data stores are so good at compressing data that there is essentially no memory overhead in comparison to relational databases. For structured data, schema validation can be implemented as needed using an extensible HTTP API.
Graph databases are good data models for relations. In many real-world cases, a graph is a very natural data model. It captures relations and can hold label information with each edge and with each vertex. JSON documents are a natural fit to store this type of vertex and edge data.
A graph database is particularly good for “graphy” queries. The crucial thing here is that the query language must implement routines like “shortest path” and “graph traversal”, the fundamental capability for these is to access the list of all outgoing or incoming edges of a vertex rapidly.
A multi-model database combines the capabilities of document, key/value, and graph databases. It allows you to choose different data models with less operational overhead. Having multiple data models available in a single database engine alleviates some of the challenges of using different data models at the same time, because it means less operational overhead and less data synchronization, and therefore allows for a huge leap in data modeling flexibility.
You suddenly have the option to keep related data together in the same data store, even if it needs different data models. Being able to mix the different data models within a single query increases the options for application design and performance optimizations. And if you choose to split the persistence layer into several different database instances (even if they use the same data model), you still have the benefit of only having to deploy a single technology. Furthermore, a data model lock-in is prevented.
The Polyglot Solution
Polyglot Persistence was accepted because it allowed us to avoid compromising on one monolithic database technology. We understood that Polyglot Persistence came at a cost in complexity, a cost to performance, and a cost to consistency and availability as clusters grew larger and larger. But most, if not all, of us felt the benefits of data model flexibility have far outweighed those costs because one monolithic database would never and will never exist to address all or even most requirements and use cases.
Today, it’s clear that we need the benefits of Polyglot Persistence without the cost. We need to have the flexibility to build high performance applications that scale horizontally and utilize a variety of data models. We need query languages that allow us to query natively and across different data models. We need databases to give us the freedom and flexibility to use different data models in unique ways as our projects inevitably evolve.
For many advanced organizations, it is already common to use a small graph database in one part of a project, a large key/value deployment for another or a combination of graph, key/value, and document (JSON) models for another.
Consider a complex data set like that for an aircraft fleet consisting of several aircraft, each consisting of several million parts, which form subcomponents, then larger and smaller components, all of which fall into a hierarchy of “items.”
For optimal fleet maintenance, the organization has to store a variety of data at different levels of the hierarchy, e.g. part or component names, serial numbers, manufacturer information, maintenance intervals, maintenance dates, information about subcontractors, links to manuals and documentation, contact persons, warranty and service contract information, etc. That kind of data hierarchy is a clear natural fit for a graph database because it captures relations between different data points including information on each edge and vertex. But is a graph database ideal to efficiently answer the key queries from the fleet maintenance team?
A graph database performs well for the question: “What are all the parts in a given component?” And it might be able to help answer the question: “Given broken part A, what is the smallest component of the aircraft that contains the part and for which there is a maintenance procedure?” But a graph database would be somewhat useless with a common question such as “Which parts of this aircraft need maintenance next week?” because the graph structure doesn’t fit the query.
However, if that graph data could be stored as JSON documents, associating arbitrary data with vertices and edges, that question could be easily answered through a document query.
The point is that to get all of the queries in that system done fast, you need a database that can store information as a variety of data models, often called a multi-model database. Wouldn’t it be nice if that graph database could implement secondary indexes on its vertex data?
But then it would essentially become a multi-model database. That’s a good first step. The ideal scenario is using a graph, document, and a key/value data model all at the same time to first find parts with maintenance due, runs the above shortest path computation for each of them, and then perform a join operation with the contacts collection to add concrete contact information. Accessing a different data model should just be changing a query, not your database. That’s where we need to go, and soon.
This fleet maintenance example is not unique or even special. In talking with developers, I’ve found that it is simply a good analog for the growing number and diversity of use cases developers are seeing. From my perspective, the fundamental learning of Polyglot Persistence is the need to use the right data model for the right job. And with innovations in database technology we can have more than one in the same database engine. Otherwise, we need to acknowledge that Polyglot Persistence is its own compromise that limits us and will eventually put us behind our competitors.
About the author: Max Neunhoeffer is senior developer and architect at ArangoDB. In his academic career, he worked for 16 years on the development and implementation of new algorithms in computer algebra. Several years ago, he shifted his focus to NoSQL databases. At ArangoDB, he is responsible for “all things distributed”, including deployment on Kubernetes, but also resilience, failover, and scalability. His particular interests include distributed transactions, self-healing distributed systems and performance tuning. If his days had 48 hours, he would play golf, go sailing, play the piano and invent a new programming language.