Follow Datanami:
July 23, 2020

The Journey to Effective Data Management in HPC

Andy Ferris

via Shutterstock

Imagine a simple interface for data search across an organization’s local and cloud storage. The search would return relevant data types, their location, and automatically extracted metadata. From there, advanced analytics could be performed in a serverless environment, and scale seamlessly to the cloud as needed. Results files would be presented in an interactive, configurable, and shareable format. Large raw data files could be transferred to collaborators in an efficient, parallel format over high speed, low latency connections.

While this visionary solution sounds like an incredible way to advance research and take advantage of diverse datasets, such a solution does not exist. When it comes to managing petascale datasets, most organizations don’t know where to start.

The Current State of Data Management in HPC

High Performance Computing (HPC) continues to be a major resource investment of research organizations worldwide. Large datasets are used and generated by HPC, and these make data management a key component of effectively using the expensive resources that underlie HPC infrastructure. Despite this critical element of HPC, many organizations do not have a data management plan in place.

As an example of data generation rates, the total storage footprint worldwide from DNA sequencing alone is estimated at over 2 Exabytes by 2025, most of which will be processed and stored in an HPC environment. This growth rate causes an immense strain on life science organizations. But it is not only big data from life sciences that is stressing HPC infrastructure, but research institutions like Lawrence Livermore National Labs (LLNL) also generate 30TB of data a day. This data serves to support their research and development efforts applied to national security, and these daily data volumes can also be expected to increase.

As the HPC community continues to generate massive amounts of file data, drawing insights, making that data useful, and protecting the data becomes a considerable effort with major implications. Having a better way to manage storage and data has multiple benefits, such as drawing insights from disparate data sources, increased researcher productivity, reduced costs, and reducing staff maintenance and administration overhead. To operate at the petascale needs of HPC, the foundations of data management must scale with the data.

Pharma HPC Data Management Case Study

A pharma company was unable to meet backup service level agreements (SLAs), and analysis was delayed due to the performance of their HPC storage infrastructure. Admins had no way of determining how old data was, or what directories were the fullest. These issues were impeding research progress. This organization needed a plan to understand their data and manage their storage.

With a data management plan in place, admins took the manual intervention out of backup and increased storage protection, capacity, performance, and usability. This company replaced the old primary storage with lower capacity, higher performance storage, and implemented a lower performance tier for backup and archive.

Data visibility was implemented in the form of an application that rapidly scanned the storage to determine directory size and age. Free text global search of archived and backed up files enabled restore functionality to meet SLAs. Workflow integration with the application API allowed efficient transfer of datasets from archive storage to primary storage for reanalysis. Through better data management, research was accelerated.

Designing a Data Management Plan

Taking the first steps to a data management plan can seem daunting, but with a list of requirements, your journey can begin. At a high level, a data management plan should include:

  • Core architecture able to scale to petabytes and even exabytes without loss in performance;
  • Visibility into files across all systems, accessible through a single interface;
  • Global search with metadata tagging and extraction;
  • Archiving functionality to free up space on primary tiers;
  • Backup for data protection;
  • Policies for backup, restore, and retention defined and conveyed to users.

    (PongMoji
    /Shutterstock)

As you evaluate solutions to meet these requirements, several key features need to be available in the application:

  • Vendor agnostic with access to all storage locations through a single interface;
  • Performance that scales with the data volumes and file count;
  • Intelligent management of storage to optimize for cost and performance. This includes backup and archive, to any cloud tier or to local storage;
  • Data analytics through an API or built-in applications, and automatic extraction and assignment of metadata from files;
  • Quick setup time with low resource requirements and simple configuration.

Currently, no such turn-key solution exists. Many organizations, when faced with the challenges of a data management plan, do not have the resources to even take the initial steps. More often than not, none of the challenges are solved, and other tasks take priority over understanding, using, and protecting data. The option of “doing nothing” is chosen as opposed to taking the first steps towards data management. Many organizations, for example, do not have a way to determine the age of files across their NAS footprint. Without this relatively basic level of data visibility, files cannot be archived to a lower cost tier, freeing space on high-performance storage.

However, taking the first steps to data management by implementing a scalable solution will allow you to scale your plan with your data. With this plan in place, you can:

  • Understand, use, and manage your data more effectively through visibility, search, and metadata
  • Protect your data and manage infrastructure through backup and archive

Instead of avoiding the considerable challenges associated with data-driven HPC, begin your journey to effective data management. Setting the foundation in a powerful, scalable solution will enable meeting performance needs now and in the future.

About the author: Andy Ferris is an experienced product manager at Igneous with experience developing new software products and managing cross-functional teams. Andy has a Bachelors of Science degree in Mechanical Engineering (ME) from the University of Washington, and a Master of Business Administration (M.B.A.) focused in Entrepreneurship from The University of Texas at Austin. At Igneous, he assists customers with understanding and effectively managing their data at petascale.

Related Items:

Bringing Big Data and HPC Together

Three Ways Big Data and HPC Are Converging

Deep Learning Taking Over HPC, But It’s Good for Business Too

Datanami