July 23, 2020

The Journey to Effective Data Management in HPC

Andy Ferris

via Shutterstock

Imagine a simple interface for data search across an organization’s local and cloud storage. The search would return relevant data types, their location, and automatically extracted metadata. From there, advanced analytics could be performed in a serverless environment, and scale seamlessly to the cloud as needed. Results files would be presented in an interactive, configurable, and shareable format. Large raw data files could be transferred to collaborators in an efficient, parallel format over high speed, low latency connections.

While this visionary solution sounds like an incredible way to advance research and take advantage of diverse datasets, such a solution does not exist. When it comes to managing petascale datasets, most organizations don’t know where to start.

The Current State of Data Management in HPC

High Performance Computing (HPC) continues to be a major resource investment of research organizations worldwide. Large datasets are used and generated by HPC, and these make data management a key component of effectively using the expensive resources that underlie HPC infrastructure. Despite this critical element of HPC, many organizations do not have a data management plan in place.

As an example of data generation rates, the total storage footprint worldwide from DNA sequencing alone is estimated at over 2 Exabytes by 2025, most of which will be processed and stored in an HPC environment. This growth rate causes an immense strain on life science organizations. But it is not only big data from life sciences that is stressing HPC infrastructure, but research institutions like Lawrence Livermore National Labs (LLNL) also generate 30TB of data a day. This data serves to support their research and development efforts applied to national security, and these daily data volumes can also be expected to increase.

As the HPC community continues to generate massive amounts of file data, drawing insights, making that data useful, and protecting the data becomes a considerable effort with major implications. Having a better way to manage storage and data has multiple benefits, such as drawing insights from disparate data sources, increased researcher productivity, reduced costs, and reducing staff maintenance and administration overhead. To operate at the petascale needs of HPC, the foundations of data management must scale with the data.

Pharma HPC Data Management Case Study

A pharma company was unable to meet backup service level agreements (SLAs), and analysis was delayed due to the performance of their HPC storage infrastructure. Admins had no way of determining how old data was, or what directories were the fullest. These issues were impeding research progress. This organization needed a plan to understand their data and manage their storage.

With a data management plan in place, admins took the manual intervention out of backup and increased storage protection, capacity, performance, and usability. This company replaced the old primary storage with lower capacity, higher performance storage, and implemented a lower performance tier for backup and archive.

Data visibility was implemented in the form of an application that rapidly scanned the storage to determine directory size and age. Free text global search of archived and backed up files enabled restore functionality to meet SLAs. Workflow integration with the application API allowed efficient transfer of datasets from archive storage to primary storage for reanalysis. Through better data management, research was accelerated.

Designing a Data Management Plan

Taking the first steps to a data management plan can seem daunting, but with a list of requirements, your journey can begin. At a high level, a data management plan should include:

Core architecture able to scale to petabytes and even exabytes without loss in performance;
Visibility into files across all systems, accessible through a single interface;
Global search with metadata tagging and extraction;
Archiving functionality to free up space on primary tiers;
Backup for data protection;
Policies for backup, restore, and retention defined and conveyed to users.

(PongMoji
/Shutterstock)

As you evaluate solutions to meet these requirements, several key features need to be available in the application:

Vendor agnostic with access to all storage locations through a single interface;
Performance that scales with the data volumes and file count;
Intelligent management of storage to optimize for cost and performance. This includes backup and archive, to any cloud tier or to local storage;
Data analytics through an API or built-in applications, and automatic extraction and assignment of metadata from files;
Quick setup time with low resource requirements and simple configuration.

Currently, no such turn-key solution exists. Many organizations, when faced with the challenges of a data management plan, do not have the resources to even take the initial steps. More often than not, none of the challenges are solved, and other tasks take priority over understanding, using, and protecting data. The option of “doing nothing” is chosen as opposed to taking the first steps towards data management. Many organizations, for example, do not have a way to determine the age of files across their NAS footprint. Without this relatively basic level of data visibility, files cannot be archived to a lower cost tier, freeing space on high-performance storage.

However, taking the first steps to data management by implementing a scalable solution will allow you to scale your plan with your data. With this plan in place, you can:

Understand, use, and manage your data more effectively through visibility, search, and metadata
Protect your data and manage infrastructure through backup and archive

Instead of avoiding the considerable challenges associated with data-driven HPC, begin your journey to effective data management. Setting the foundation in a powerful, scalable solution will enable meeting performance needs now and in the future.

About the author: Andy Ferris is an experienced product manager at Igneous with experience developing new software products and managing cross-functional teams. Andy has a Bachelors of Science degree in Mechanical Engineering (ME) from the University of Washington, and a Master of Business Administration (M.B.A.) focused in Entrepreneurship from The University of Texas at Austin. At Igneous, he assists customers with understanding and effectively managing their data at petascale.

Related Items:

Bringing Big Data and HPC Together

Three Ways Big Data and HPC Are Converging

Deep Learning Taking Over HPC, But It’s Good for Business Too

Sectors: Academia

Vendors: Igneous

Tags: Andy Ferris, big data, data management, file system, genomic sequencing, hpc

Only registered users may comment. Register using the form below.

Check off newsletters you would like to receive*
- HPCwire
- EnterpriseTech
- Datanami
- Technology Conferences & Events
- Advanced Computing Job Bank
- Technology Product Showcase
Email*
Name*
First Last
Organization*
Job Function*
Industry*
Country*
City*
State*
Province*
- Please check here to receive valuable email offers from Datanami on behalf of our select partners.

The Journey to Effective Data Management in HPC

The Current State of Data Management in HPC

Pharma HPC Data Management Case Study

Designing a Data Management Plan

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

April 25, 2024

April 24, 2024

April 23, 2024

Sponsored Partner Content

Get your Data AI Ready – Celebrate One Year of Deep Dish Data Virtual Series!

Supercharge Your Data Lake with Spark 3.3

Learn How to Build a Custom Chatbot Using a RAG Workflow in Minutes [Hands-on Demo]

Overcome ETL Bottlenecks with Metadata-driven Integration for the AI Era [Free Guide]

Gartner® Hype Cycle™ for Analytics and Business Intelligence 2023

The Art of Mastering Data Quality for AI and Analytics

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Top 6 Strategies for Reducing Data Warehouse Costs

Building an Operational Data Warehouse for Real-time Analytics

Sponsored Multimedia

The Power of DataOps: Bring Automation to Life
No Comments

Tactical Steps for Cloud Migration
No Comments

Immuta Data Access Platform
No Comments

Data Mesh: Fact or Fiction?
No Comments

Contributors

Featured Events

AI & Big Data Expo North America 2024

CDAO Canada Public Sector 2024

AI Hardware & Edge AI Summit Europe

AI Hardware & Edge AI Summit 2024

CDAO Government 2024

The Journey to Effective Data Management in HPC

The Current State of Data Management in HPC

Pharma HPC Data Management Case Study

Designing a Data Management Plan

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

April 25, 2024

April 24, 2024

April 23, 2024

Most Read Features

Most Read News In Brief

Most Read This Just In

Sponsored Partner Content

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Sponsored Multimedia

Contributors

Featured Events

Share

Copy short link