Follow Datanami:
April 28, 2014

Next-Generation DNA Sequencing Performance at Scale

When organizations sequence and analyze DNA, a common workflow includes steps to move large sets of intermediate or processed data between sequencers and systems dedicated to assembly and analysis. Less discussed, but equally important, is the requirement for these data sets to be managed throughout a life-cycle of use — from assembly of raw data to archival of analyzed data. At the center of a next-generation sequencing (NGS) bioinformatics solution is the need for a workflow-driven storage solution tailored to the individual research or clinical lab.

The Cray NGS solution comprises three core elements — compute, storage and analysis — each designed for the highest level of performance.

Compute at the NGS core

Compute lies at the core of NGS sequencing and assembly workflows. In these workflows, raw data output from next-generation sequencers is passed to systems performing the computationally intense assembly of genomes (the work of turning fragmented digital representations of genomes into whole genome maps).

Cray® XC30™ supercomputers and Cray® CS300™ cluster supercomputers are suited to the requirements of research institutions and clinics performing sequencing and assembly on a daily basis. What sets the company’s technologies apart is the fact that they can meet a range of requirements. Cray systems are designed to handle everything from special-purpose compute needs to diverse sets of applications.

In facilities that support multiple applications, the Cray XC30 supercomputer provides a robust and scalable architecture for bioinformatics. For institutions looking for a dedicated assembly system, the Cray CS300 cluster supercomputer series — and in particular, the CS300 Large Memory System incorporating vSMP Foundation™ from ScaleMP™ — supports assembler applications requiring large amounts of shared memory such as Velvet.

Regardless of the choice, Cray systems can be used to run community-standard applications such as Galaxy to manage NGS workflows and provide result visualization and analysis.

The need for workflow-driven storage

NGS workflow involves repetitive manipulation of raw data — DNA fragment files output from sequencers and measuring upwards of 300 gigabytes — assembled into whole genome maps and used for research and clinical purposes.

In environments using the XC30 supercomputer, the Cray® Sonexion® scale-out Lustre® storage system simplifies deployment and management with storage that delivers exact levels of performance in an integrated and preconfigured package.

Organizations deploying the CS300 system for NGS can choose from a range of storage solutions from Cray, DDN, NetApp and other manufacturers.

Often forgotten is the fundamental problem of cost-effectively managing and maintaining an ever-increasing collection of data sets associated with NGS workflows. The explosion of data from NGS — from raw sequence data to final results data — has unleashed an unprecedented data management responsibility. For many organizations, data growth is outpacing the ability to manage and archive it. This situation has created a need for an archival system to transparently migrate data to different tiers of storage — from high-performance scratch parallel file systems to capacity-optimized disk and tape archives.

Cray Tiered Adaptive Storage (TAS), powered by Versity, is the only complete and open archiving solution built for enterprise-class Linux® environments, including Cray XC30 and CS300 systems. TAS is designed to meet NGS customers’ massive scalability needs. Cray preconfigures, integrates and tests all hardware and software to provide a ready-to-deploy system.

Cray solutions for analysis

Data sequenced, assembled and captured is of little value unless it can be analyzed. Together, Cray and YarcData provide a comprehensive set of platforms to perform visualization and analysis. Widely used genome analytics applications such as Galaxy and BLAST (NCBI and AbokiaBLAST) are available for the XC30 and CS300 systems.

As researchers and healthcare professionals expand the practice of precision medicine, rapid analysis of genomic information will increasingly play a role in patient diagnosis and treatment. Big data graph analytics can help healthcare professionals, analysts and scientists take full advantage of their data. It enables the capture and exploration of relationships among vast data sources impossible to achieve with a search approach — and turn data’s latent value into realized value.

YarcData’s Urika® graph analytics appliance is purpose built for discovery and enables new insights in real time. It addresses the limitations of commodity hardware, scaling to meet increasing volumes of data and quickly updating relationships as new data streams in.

Why Cray?

The ability to turn pioneering hardware and software technologies into renowned supercomputing solutions is the work of decades, and no one else has more experience than Cray. It’s why leading users across industries and disciplines repeatedly choose Cray. From technical enterprise- to petaflop-sized solutions, Cray systems enable tremendous scientific achievement by increasing productivity, reducing risk and decreasing time to solution.

http://www.cray.com/bio-itworld/

Datanami