Language Flags

Translation Disclaimer

HPCwire Enterprise Tech HPCwire Japan

May 29, 2012

The Evolution of Data Management

The rate of hardware innovation has vastly outpaced that of software - and database systems in particular. The fundamental characteristics of hardware have been revolutionized yet database architecture has not and persists with structures that date back to a bygone era.

There have been numerous database innovations, but they have tinkered at the edges rather than solve the fundamental problems. We have seen a plethora of band aid architectures where features of the database are designed to alleviate specific performance problems rather than resolve them.

Indeed, the industry has largely focused on scaling hardware to overcome the performance deficiencies of databases rather resolve the fundamental hardware inefficiency.

One fundamental problem with relational databases is that the way the data is stored – by row or by column – limits how the data can be used. Thus relational databases are either architected for operations or for analytics – but not both at the same time. This creates complexity and cost  when delivering analytics against operational data – especially for real-time or operational analytics.

The traditional relational database is the row store, which dates back to the 1970s. Even the more recent column storage used for analytics is a concept that dates to the 1970’s.

Almost any business measure you examine has radically changed since the relational database was first conceived in the 1970’s.

A database back then was largely operational and was purely responsible for providing a definitive record of the current operational state of the business. Now businesses also need to know how they got to where they are for both analytical and compliance reasons.

A very large database back then was measured in megabytes, which is now a trivial volume by comparison to the terabyte volumes that are commonplace today.

Back in the 1970’s, business intelligence was serviced by monthly or weekly summary reports.  Now we see a need for both real-time and for sophisticated analytics.

Before mobile phones and the internet, transactions were driven by customer care centers with limited human bandwidth. Today many transactions are now submitted through self-service operations or autonomous device notifications and the volumes are enormous by comparison.

Next -- Advances in Hardware >>


Advances in Hardware

Both the schema and the queries submitted were well defined in advance by requirements gathering using conventional waterfall design processes. That’s all changed. It’s a lot more unpredictable these days with businesses constantly optimizing their operations and rapidly responding to new trends or markets.

Back in the 1970’s, the CPU and memory were joined at the hip, such that memory was the cache for the CPU. Back then storage latency was the only performance problem and there was only a “storage wall” to overcome.

Since then, CPU speed and transfer rates have increased a thousand fold while latency in storage and memory has lagged to the point where there is now a “memory wall” to overcome as well. Memory is no longer fast enough for the CPU.

Thankfully, the speed of light remains the same – but this has important implications for data access. Whereas, in 1970, information could travel 300 metres within one CPU tick, that distance has been reduced to 100 millimetres with the increase in CPU clock speed.

If memory or storage sits further than 5cm from the CPU, then the CPU has to stall while waiting to fetch new data from it. Hence CPUs have their own cache. Even if the storage is as fast as static RAM – it will still create a storage wall if it doesn’t sit right on the motherboard alongside the CPU.

Next -- The Rise of Band-Aid Architectures >>


The Rise of Band-Aid Architectures

Unsurprisingly, the last century storage structures that are still used extensively by relational databases today fail to exploit contemporary hardware and use it efficiently.

Meanwhile, the industry has focused on fixing the problem with a band aid architecture.

If indexing is slow then partition the indexes to mitigate the problem.

If loading rates are slow then provide non-transactional bulk load utilities.

If joins are too slow, then  de-normalize the schema to avoid them.

None of these solutions fixes the fundamental inefficiency – it is simply a workaround.

Next -- The Three Deadly Database Flaws >>

The Three Deadly Database Flaws

Databases need the solve three fundamental flaws.

Databases are not general purpose. A row store does operations; while a column store does analytics. A row store indexed for analytics will struggle with operations. Moreover, any database becomes more specialized as more indexes are layered onto it – it becomes more adept at doing one job well and less able to perform other tasks.

Databases need to become general purpose to reduce the cost and complexity that arise when organizations have dozens or hundreds of interconnected’ special-purpose’ databases.

Databases so not understand their data. Put a ton of data into a simple row store and it remains useless until you layer indexes on top of it. Column stores are only able do the most basic of aggregations until additional structures are added. Inevitably, these simplistic row and column storage models require physical design effort to make them useful. To overcome this, databases need to understand their data at a higher semantic level rather than simple physical rows, columns and data types.

Databases heavily rely on hardware scaling to overcome their performance limitations. But scaling-out adds complexity and cost and introduces new problems – such as network latency, consistency between nodes, and network bandwidth problems from distributed joins. Databases need to make more effective use of the power of the hardware and avoid unnecessary scale-out.

A relational database uses a logical schema of tables and columns to precisely reflect the application domain it is designed to serve. These logical structures are very agile – most mature relational databases allow tables and columns to be added, altered or dropped at will and instantaneously. The logical schema is responsive and can easily adapt to an evolving application.

But there are also physical structures such as indexes and partitions. These are the dark underpinnings of the database and create all of the pain. Adding or altering an index or partitioning scheme can consume a huge amount of time and resources on a non trivial database. These structures make the database rigid because they create compromise and cause delays.

Next -- Delivering on the Database Promise >>


Delivering on the Database Promise

An application developer should not need to concern himself with physical database design. Databases need to alleviate the pain of physical design by understanding their data better.

Databases need to align their data structures with the characteristics of contemporary hardware.

While transfer rates are fast, latency remains a big issue for both memory and storage. Data structures need to be designed to amortize latency by minimizing the number of fetch requests made to memory and storage and optimizing the size of data transferred by each request. This means providing good spatial locality whereby the majority of data required for any individual operation is co-located in storage.

Multiple cores with private caches are commonplace and they use an expensive cross-core protocol to maintain consistency between those caches . This cache coherency protocol can limit CPU performance when cores are required to share updates. Hence the data structures used by databases need to allow arbitrary and independent parallel access by multiple cores while requiring  minimal synchronization and communication between CPU cores.

Databases need to separate their storage structure from the data model use by the developer.  The database storage does not need to be hardwired into providing a relational, object or key-value data model view of the world. The data model should just be a convenient view in which a developer chooses to work – meanwhile the database can handle the translation between the developer view of the data and its physical structure.

This would allow multiple models to coexist against the same data and obviate the debate about the best use of relational vs. NoSQL databases.

Scale-out is a necessary strategy to deal with unlimited data volumes and database activity. But while scale-out solves a limited set of performance problems it brings its own challenges, including added latency, bandwidth limitations, consistency issues and cost and complexity. These problems mostly arise from physical constraints and are inevitable. Hence scale-out is best treated as a solution of last resort rather than an option of first choice.

However, in scaled-out environments, transactions need to be able to choose what guarantees they require – rather than enforcing or relaxing ACID constraints across a whole database. Each transaction should be able to decide how synchronous, atomic or durable it needs to be and how it must interact with other transactions. For example, must a transaction be applied in chronological order or can it be allowed out of time order with other transactions providing the cumulative result remains the same? Not all transactions need be rigorously ACID and likewise not all transactions can afford to be non-atomic or potentially inconsistent.  

Hardware will continue to evolve and databases need to follow the trends.

Cores will continue to proliferate and databases need to become inherently parallel within a single server.  This needs to be treated as a shared-nothing scaling problem within a single CPU because unnecessary communication between cores will throttle performance.

SSD has brought storage speeds closer to that of memory. Meanwhile large non-volatile memory is a technology in development and is probably only a few years away from commercialization. The distinction between storage and memory will eventually disappear and that will change the way applications want to interact with a database and databases will need to adapt accordingly.

But even with non-volatile memory - the problem of the memory wall will remain for some time and will continue to govern performance limitations.

Share Options


» Subscribe to our weekly e-newsletter


There are 0 discussion items posted.


Most Read Features

Most Read News

Most Read This Just In

Cray Supercomputer

Sponsored Whitepapers

Planning Your Dashboard Project

02/01/2014 | iDashboards

Achieve your dashboard initiative goals by paving a path for success. A strategic plan helps you focus on the right key performance indicators and ensures your dashboards are effective. Learn how your organization can excel by planning out your dashboard project with our proven step-by-step process. This informational whitepaper will outline the benefits of well-thought dashboards, simplify the dashboard planning process, help avoid implementation challenges, and assist in a establishing a post deployment strategy.

Download this Whitepaper...

Slicing the Big Data Analytics Stack

11/26/2013 | HP, Mellanox, Revolution Analytics, SAS, Teradata

This special report provides an in-depth view into a series of technical tools and capabilities that are powering the next generation of big data analytics. Used properly, these tools provide increased insight, the possibility for new discoveries, and the ability to make quantitative decisions based on actual operational intelligence.

Download this Whitepaper...

View the White Paper Library

Sponsored Multimedia

Webinar: Powering Research with Knowledge Discovery & Data Mining (KDD)

Watch this webinar and learn how to develop “future-proof” advanced computing/storage technology solutions to easily manage large, shared compute resources and very large volumes of data. Focus on the research and the application results, not system and data management.

View Multimedia

Video: Using Eureqa to Uncover Mathematical Patterns Hidden in Your Data

Eureqa is like having an army of scientists working to unravel the fundamental equations hidden deep within your data. Eureqa’s algorithms identify what’s important and what’s not, enabling you to model, predict, and optimize what you care about like never before. Watch the video and learn how Eureqa can help you discover the hidden equations in your data.

View Multimedia

More Multimedia


Job Bank

Datanami Conferences Ad

Featured Events

May 5-11, 2014
Big Data Week Atlanta
Atlanta, GA
United States

May 29-30, 2014
St. Louis, MO
United States

June 10-12, 2014
Big Data Expo
New York, NY
United States

June 18-18, 2014
Women in Advanced Computing Summit (WiAC ’14)
Philadelphia, PA
United States

June 22-26, 2014

» View/Search Events

» Post an Event