This week the Texas Advanced Computing Center (TACC) at the University of Texas at Austin announced a $10 million commitment from the O’Donnell Foundation to enhance their data-intensive science capabilities.
TACC says that this funding will be used for new data infrastructure that will allow the center to broaden the scope of big data problems in science. Researchers in diverse fields, including bioinformatics, neuroscience, structural biology and astrophysics, among others will be able to advance beyond current constraints—hopefully yielding new discoveries along the way.
According to TACC officials, the data infrastructure plans include:
- high-performance, petascale data storage system accessible to all of TACC's computing and visualization systems, and easily expandable to hundreds of petabytes in the coming years;
- a computational system with embedded high-speed storage that is optimized for data-intensive computing, including massive data processing and analysis; and
- new servers and storage to host innovative Web-based and cloud computing services, including science portals and gateways that enable researchers around the world to use the university's research applications.
To delve further into this, we talked about the funding and what it means for the future of data-intensive science at TACC with the center’s director, Dr. Jay Boisseau. He shed light on some of the specifics of the upcoming technology purchases and exploration areas—and also lent insight about how the center is already working with big data problems in science with existing clusters.
Datanami: What elements is TACC seeking to fulfill the need for new high-end data-intensive capabilities? Does this mean a new cluster entirely--and if so, can you say who or what type of system you're considering?
Boisseau: We will deploy a new high-speed parallel filesystem (estimated start size 20 PB) that is accessible from all TACC resources, and that can be scaled up from 100+ petabytes.
We will also deploy a cluster optimized for MapReduce/Hadoop-style calculations--lots of node-level disk for persistent storage of data collections for which this programming model is optimal.
We will also provide a new high-throughtput computing capability and larger shared memory memory capabilities. We already provide these capabilities on Ranger and Lonestar, but we will provide them at greater scale/emphasis in the new systems. They may be part of the same cluster that provides the MR/Hadoop capability.
We will also provide a new hosting environment for science portals and gateways that host front-end applications to workflows that leverage the new data resources ( as well as our current and future HPC and vis resources).
We will also evaluate opportunities for using SSD and other technologies for new data applications.
Datanami: How is TACC benchmarking or evaluating data-intensive computing solutions? Is this different than making HPC/supercomputing decisions in that speed might not be the defining factor and how do rankings like the Graph500 or other HPC/data-intensive benchmarks fit into your decision-making process?
Boisseau: It is different, and we're still working through some of this. We are going to host a workshop on May 22-24 (announcement coming next week) at which we expect to discuss these and other relevant questions. We think there is a great need even for clearer definitions of terms and requirements for 'data intensive computing,' 'data driven science,' etc., and understanding the science and technology requirements of classes of applications will help us develop a methodology for carefully designing the configurations for new data intensive computing systems.
Datanami: TACC is already home to high end HPC systems; where will a new data-intensive system fit into your existing technology “portfolio” of supers—and what applications will be specific to any new machine that might not have been acceptable to run on other TACC clusters
Boisseau: We're home to high-end HPC (Ranger, Lonestar) *and* scientific visualization systems (Longhorn, Stallion), and we just upgraded our existing data management systems: our data collections hosting system (Corral) and our archival system (Ranch).
We have added new 'data intensive computing' capabilities to some systems: software to bundle jobs and enable HTC on Ranger and Lonestar; large shared memory nodes (1TB) on Lonestar; and a Hadoop-style subsystem on Longhorn. The major new systems we will deploy with this new funding are: a separate cluster designed and optimized for more data driven science applications by offering larger MR/Hadoop style capability, larger shared memory, better HTC capabilities, etc.; and also a large high-speed filesystem that all HPC, visualization, and data clusters can access.
Thus, our HPC, visualization, and data intensive computing cluster systems will all have access to a high-speed parallel file system, a data collections management system, and a data archival system. A gateway hosting environment will host portals and other applications that can leverage all of the back-end systems.
In addition to TACC’s upcoming capabilities for departmental projects, the center says the new resources will also augment TACC's ability to support research at related university institutions, including biomedical research at UT Southwestern Medical Center. As the statement noted, “Novel data-driven projects such as consumer energy usage behaviors being studied at Austin's Pecan Street Inc. will also benefit, as will major national projects in which the university is a key partner such as the iPlant project, a $50 million National Science Foundation-funded project to help with plant research, including improving food yields and producing more effective biofuels.”
The O'Donnell Foundation has already contributed $6 million of the commitment to The University of Texas at Austin and will provide $2 million more in each of the next two years. The university will also provide an additional $2 million over five years to hire new technology professionals at TACC, who will support and accelerate new research in ICES and other university programs that leverage these data resources.