Five Signs Your Cache-Based Database Architecture May Be Obsolete
The digital economy comprises business moments, critical fractions of seconds when lightning-fast chain reactions take place that transform data into insights and turn opportunities into business values. As data has increased in both velocity and volume, the common practice to support this growth has been to add more cache.
But a cache-based database architecture was never designed to handle the volume and latency requirements for today’s systems of engagement. At high volumes, cache becomes unaffordable, untrustworthy, and unstable.
Instead of more memory or a better cache, a better data architecture is needed. To achieve instantaneous decision-making, digital enterprises require a new hybrid memory architecture that processes transactional and analytical data together — in real time — to unlock those business moments.
According to a report from Forrester, “Hybrid memory architecture is a new approach that leverages both volatile memory (DRAM) and nonvolatile memory such as SSD and flash to deliver consistent, trusted, reliable, and low-latency access to support existing and new generations of transactional, operational, and analytical applications. Implementing this type of architecture allows organizations to move from a two-tier in-memory architecture to a single-tier structure that simplifies the movement and storage of data without requiring a caching layer. Early adopters are seeing several benefits to their business like a lower cost of ownership, huge reductions in their server footprint, simplified administration, and improved scalability.”
Here are five signs that your cache-based database architecture may be obsolete and it could be time to make the leap to a hybrid memory architecture:
Your Caching Nodes Are Growing Uncontrollably
As your business grows, so does your data and the size of your cache, at a rate that’s directly proportional. As the value of engagement increases, new applications and projects clamor for database access, increasing transaction volumes and cache working-set sizes. Server counts need to go up, even when budgets are set at the beginning of the year. When growth happens beyond expectations, it must be addressed, or your system of engagement won’t be able to keep up.
Funding this growth is unsustainable. Business growth means a non-linear growth in data as more information is compiled or queried per customer and per transaction. Add to that the use of additional data sources for better analysis, and your cache will rapidly expand to consume the budget available — and then some. You can rebalance cached data to SSD and lower costs in the near term, but this adds complexity to managing that data.
Repopulating Your Cache Takes Hours or Even Days
Disruptions are a fact of life. The larger your data center or cloud, the more frequent the disruptions. Provisioning a new caching server in the cloud and DevOps era is now a matter of minutes for most companies. But that doesn’t apply to the data in your caching layer. It has to be “rehydrated” to a level where the hit rate is acceptable before it has its intended impact of reducing database load. For most data-heavy companies this process can take hours or even days — forcing them to deal with limited performance, inaccurate data, the unnecessary cost of even greater over-provisioning, and application complexity.
Let’s take a look at a real-world example. The sixth-largest brokerage firm in the world was running its intraday trading system on a traditional cache-based relational database management system (RDBMS) architecture. At its daily transaction volume, the brokerage had challenges with both the cache and the database failing during the day. It could not perform accurate risk calculations more than once a day (overnight) for fear of overloading the cache-based system. Therefore, the firm was flying blind during the trading period in terms of making financial decisions, like margin loans, etc. With a hybrid memory system, risk metrics can be reevaluated every few minutes to help make better business decisions. Another issue is encountered when customers trade multiple times throughout the day and it’s imperative that a customer’s position — the amount of stock and money in the account — be accurate all the time. What happens if the standby cache has stale data and the customer can’t make an additional trade because the position shows insufficient or no funds? Eventually, the customer’s position will be correct, but how long will that take and at what cost? At best, the customer has a scary experience while waiting to refresh the screen a few times. And at worst, the brokerage could be held liable for missed or inaccurate trades.
You Still Can’t Meet Your SLAs with Cache-First Architecture
When you’re the fraud detection system for a major online payments system, not meeting service level agreements (SLAs) can mean millions of dollars in lost or fraudulent transactions per day. Neither an RDBMS nor a first-generation NoSQL database is going to be fast enough on its own to meet submillisecond response times, so you have to put a cache in front of it. But it’s rarely as simple as that, and this architecture does nothing to guarantee meeting SLAs in the face of growth.
Take the case of one of the largest payment processing firms in the world. It needed to plan for 10x growth as the payments landscape was undergoing rapid change. Scaling the architecture to meet such an increased load meant, at best, going from 300 to 3,000 servers. At worst, it meant scaling up to 10,000 servers. Switching to a hybrid memory architecture had dramatic results for the company. The server count went from 300 to 20. It eliminated the caching layer and the dual database clusters, saving millions of dollars in operating costs. And the performance of the fraud algorithm went from 175 milliseconds to less than 80 milliseconds — at peak loads. The hybrid system continues to process the transaction even if the fraud detection does not return in time. This means that the previous algorithm’s missed SLAs resulted in fraud exposure on transaction amounts of up to several million dollars a day, or over a billion dollars in a year.
Your Data Is Large Enough That You Need Clustering
Your business is finally starting to grow and your cache size has morphed into a distributed in-memory database requiring the added burden of sharding, clustering, and other new techniques. And it is hard to find personnel with these skill sets.
You may already be using sharding inside your application to create more capacity — it is the best practice, after all. However, with enough growth, sharding may no longer be enough, which leads us to cluster management. Clustering mechanisms are typically more advanced than sharding, with a new set of limitations and trade-offs that must be understood; for instance, some commands may not be supported in a clustering environment, multiple databases may not be supported, programmers need to know which node serves which subset of keys, and key lookup across clusters can become an issue.
Cache Stampedes Happen Often
A cache stampede is a type of cascading failure, which could be initiated by a random failure of a single node. This can be caused by dependence on poor clustering/sharding algorithms that lead to an unbalanced load on the remaining nodes.
From your users’ standpoint, a cache stampede means that the item they want to see or buy won’t load, and will eventually time out. Users, in their impatience, will either abandon their request or try to refresh the page, exacerbating the problem. In either case, the results can be disastrous for your reputation or revenue stream.
There are three methods for dealing with cache stampede: locking, external recomputation, and probabilistic early expiration. All involve code changes at the application level, which then must be “sold” to development groups for them to integrate into their code to prevent a recurrence. At the end of the day, none of these methods solves the problem.
So why should your application developers bear the burden of cache and database administration in the first place? It creates needless complexity, impacts quality and time to market, and places the customer experience at greater risk. Why not drop the caching layer altogether and rely on a distributed database that manages all these problems, without requiring the application to get involved?
Reevaluate the Need for External Caching
The above issues of server growth, architectural complexity, instability, and cache stampede indicate that an external caching layer is not always the best strategy for success, especially for systems with spiky or heavy, continuously growing data loads.
The days of the external caching layer as the de facto architecture for performance and scale are long gone. The growth in data and the continued downward pressure on response times have rendered this technology obsolete for many use cases. It’s time to challenge current thinking about best practices and accepted architectures. A hybrid memory architecture is key to current and future success for businesses going through digital transformation.
About the author: Srini Srinivasan is chief product officer and founder at Aerospike, a developer of an enterprise-grade, non-relational database. He has two decades of experience designing, developing, and operating high-scale infrastructures. He also has more than 30 patents in database, web, mobile, and distributed systems technologies. He co-founded Aerospike to solve the scaling problems he experienced with internet and mobile systems while he was senior director of engineering at Yahoo.