Follow Datanami:
November 22, 2017

Backing Up Big Data? Chances Are You’re Doing It Wrong

Peter Smails


The increasing pervasiveness of social networking, multi-cloud applications and Internet of Things (IoT) devices and services continues to drive exponential growth in big data solutions. As businesses become more data driven and larger, more current data sets become important to support the online business processes, analytics, intelligence and decisions. Additionally, data availability and integrity become increasingly critical as more and more businesses and their partners rely on these (near) real-time analytics and insights to drive their business. These big data solutions typically are built upon a new class of hyper-scale, distributed, multi-cloud, data-centric applications.

While these NoSQL, semi-structured, highly distributed data stores are perfect for handling vast amounts of big data on a large number of systems, they can no longer be effectively supported by legacy data management and protection models. Not only based on the sheer data size and the vast number of storage and compute nodes, but also because of built-in data replication, data distribution, and data versioning capabilities – a different approach for backup and recovery is needed. Even though these next-generation data stores have integrated high availability and DR capabilities, events like logical data corruption, application defects, and/or simple user errors still require another level of recoverability.

To meet the requirements of these high-volume and real-time applications in a scale-out, cloud centric environments, a wave of new data stores and persistence models has emerged. Gone are the days of just files, objects and relational databases. The next-generation key-value stores, XML/JSON document stores, arbitrary width column stores and graph-databases (sometimes characterized as NoSQL stores) share several fundamental characteristics that enable the big data driven IT. Almost without exception, all big data repositories are based on a cloud-enabled, scale-out, distributed data persistence model that leverages commodity infrastructure while providing some form of integrated data replication, multi-cloud distribution and high-availability. The big data challenges aren’t limited to just the data ingest, data storage, data processing, data queries, result set capturing, visualization, but also pose increasing difficulties around data integrity, availability, recoverability, accessibility and mobility/movement. Let’s see how this plays out in a couple example case studies.

(Tatiana Shepeleva/Shutterstock)

A first case study revolves around an Identity and Access Management service provider that uses Cassandra as its core persistence technology. The IDaaS (Identity as a service) is a multi-tenant service with a mixture of large enterprise, SMB and development customers and partners. The Cassandra database provides them with a highly scalable, distributed, high available data store that supports per tenant custom user and group profiles (i.e. read dynamic extensible schemas). While the data set may not be very large in absolute storage size, the number of records definitely will be in the 10’s, if not 100’s of millions.

What drives the unique requirements for recoverability is the multi-tenancy and the 100% availability targets of the service. Whether it is through user error, data integration defects and changes, or simply tenant migrations, it may be required to recover a single tenant’s data set without having to restore the whole Cassandra cluster (or replica thereof) in order to restore just one tenant instance. Similarly, the likelihood that the complete Cassandra cluster is corrupt is slim and in order to maintain (close to) 100% availability for most tenant service instances, partial recovery would be required. This drives the need for some level of application aware protection and recovery. In other words, the protection and recovery solution must establish and persist some application data semantic knowledge to be able to recover specific, consistent Cassandra table instances or point-in-times.

The second case study is centered around a Hadoop clustered storage solution, whereby the enterprise application-set persists its time-series data from devices and their end-user activities in the Hadoop filesystem. The Hadoop storage acts as de-facto “data lake” fed from multiple diverse data sources in different formats, whereby the enterprise can now apply various forms of data processing and analysis through map-reduce batch processing, real-time analytics, streaming and/or in-memory queries and transformations. Even though a map-reduce job creates ephemeral intermediate and end results that in principle could be recreated by running the job once more in case of failure or corruption, the data set can be too large (and therefore too expensive to reprocess) and undergoing constant updates.

Even though Hadoop provides replication and erasure encoded duplication (for high-availability and scale-out), there really is no data versioning or snapshots for that matter (given the original ephemeral model of the map-reduce processing). Any logical error, application or service failure or plain user error, coul result in data corruption or data loss. Data loss or corruption could occur to the original ingested data, any intermediate ephemeral data or data streams, as well as any resulting datasets or database instances and tables. Rather than creating a full copy of the Hadoop file-system for backup and recovery of intermediate files and database tables (which would be cost prohibitive and/or too time consuming), a different approach is needed. In order to do so, a better understanding of the application data sets and their schema’s, semantics, dependencies and versioning is required.

Looking at both case studies, there is common thread amongst them driving the need for a different approach to data management and specifically backup and recovery:

  • Both Cassandra and Hadoop provide integrated replication and high-availability support. Neither capability, however, provides sufficient, if any protection against full or partial data corruption or data loss (human, software or system initiated). An actual application data centric or aware backup is needed to support data recovery of specific files, tables, tenant data, intermediate results and/or version thereof
  • However, a storage centric (file or object infrastructure) backup solution is not really feasible. The data set is either too large to repeatedly be copied in full, or a full data set takes too large an infrastructure to recover fully or to extract just specific granular application data items. In addition, storage centric backups (file system snapshot, object copies, volume image clones or otherwise) do not provide any insight into the actual data set or data objects that the application depends on. On top of the fully recovered storage repository, an additional layer of reverse engineered application knowledge would be required as well.
  • Application downtime is critical now more than ever. In both case studies, multiple consuming services or clients depend on the scale-out service and persistence. Whether it’s a true multi-tenant usage pattern, or multitude of diverse data processing and analytics applications, the dataset needs to be available close to 100%. Secondly, a full data-set recovery would simply take too long and the end-users or clients would incur too much downtime. Only specific, partial data recovery would support the required SLA’s.

The requirement for an alternate data management and recovery solution is not limited to just the above described Cassandra and Hadoop case studies. Most big data production instances ultimately do require a data protection and recovery solution that supports incremental data backup and specific partial or granular data recovery. More importantly the data copy and recovery must acquire semantic knowledge of the application data in order to capture consistent data copies with proper integrity and recoverable granularity. This would allow the big data DevOps and/or Production Operations teams to just recover data items that are needed without having to do a full big data set recovery on an alternate infrastructure. For example, the data recovery service must be able to expose the data items in the appropriate format (e.g. Cassandra tables, Hadoop files, Hive tables, etc.) and within a specific application context. At the same time the protection copies must be able to be distributed across on-premise infrastructure as well as public cloud storage to leverage both cost effective protection storage tiering and scaling as well as support alternate cloud infrastructure recovery.

A solution that provides big data protection and recovery in a granular and semantic aware approach not only addresses “Big Data Backup” in the appropriate fashion, but it also creates opportunities to extract and use data copies for other purposes. For example, the ability to extract application specific data copies or critical parts of the big data set enables other users to efficiently get down-stream datasets for test and dev, data integrity tests, in-house analytics, 3rd party analytics or potential data market offerings. Combining this with multi-cloud data distribution, we then get closer to realizing a multi-cloud data management solution that starts to address today’s and tomorrow’s needs for application and data mobility, as well as their full monetization potential.

About the author: Peter Smails is vice president of marketing and business development at Datos IO, provider of a cloud-scale, application-centric, data management platform that enables organizations to protect, mobilize, and monetize their application data across private cloud, hybrid cloud, and public cloud environments. A former Dell EMC veteran, Peter brings a wealth of experience in data storage, data protection and data management.”

Related Items:

Big Data Begets Big Storage

Data Recovery Gets Speed, Security Boost

Tags: , ,