The Real Challenges of ExaScale
Last November at SC11 DataDirect Networks issued an open letter to the HPC community, highlighting the issues and challenges we face as a community at the Dawn of Exascale. With ISC 2012 happening next month in Hamburg, it is time to reflect on what those challenges are, and the progress being made to address them. Consider this short article a primer on what to look for at ISC’12 in terms of the presentations and developments that are being highlighted, and a way to check for yourself if progress is being made in addressing the real challenges of Exascale.
At heart is the urgency behind finding solutions to the unprecedented challenges that Exascale computing presents the entire HPC community. Among them is the need to increase computational performance by 3 orders of magnitude while providing a balanced I/O environment capable of supporting peak burst rates of over 100 Terabytes per second with multiple Exabytes of total capacity. Of significantly greater challenge will be supporting the dramatic increase in total concurrency and providing a highly resilient I/O substrate as component counts increase and voltage thresholds decrease. Addressing these challenges will require a holistic approach spanning Applications, I/O middleware, storage system software, and the entire compute infrastructure from node architecture, network, and persistent storage technologies.
DDN recognizes that just as scientific discovery happens through partnership and collaboration, so should the process for building systems that are three orders of magnitude more capable than today’s Petascale I/O environments. This advancement must be the product of a cross-community partnership, which includes not only CPU, memory, network, storage media, and infrastructure manufacturers, but perhaps more importantly, the scientific computing community whose applications must scale and the users of these Exascale systems – on whose success we are all focused. Due to the wide-ranging impact of the Exascale challenge, we believe that it is important to engage the wider HPC community in this important, collaborative discussion. Some areas of the HPC infrastructure may see substantial change and we will need to coordinate these solutions if we are to build usable Exascale systems in this decade. DDN is firmly committed to an ecosystem approach in developing solutions to these challenges allowing the community to leverage best-of-breed technologies designed and developed through partnership and collaboration.
While Exascale systems pose significant challenges for the community to address by the end of the decade, many of the challenges we anticipate in the Exascale era are confronting us even today. Data is growing faster than our ability to manage it and our scientist’s ability to extract useful knowledge from it. This challenge that we face today can overwhelm us if we do not step outside of the traditional approaches to storage and data management. This data explosion spans many HPC domains, including:
- Climate Science – Ultra-high resolution climate models and model inter-comparison
- Material Science – Large scale ensemble analysis and materials by design
- Genomics – Driven by comparative analysis and personalized medicine
- Uncertainty Quantification – Statistical analysis applied to ensembles of simulations
- Oil & Gas – High density, wide azimuth surveys iteratively measuring reservoirs
- Financial services – Risk quantification for a growing number of markets and securities
- CFD – Extending simulations from steady sub-components to full unsteady vehicle analysis
Today’s HPC I/O environment has reached the limits of its ability to scale. Attempts to build on the legacy I/O architecture will become increasingly expensive and fragile, as current file system and storage technologies, originally designed for use at much smaller scales,, are reaching the limits of their scalability. Below are several areas where DDN believes that new approaches can be brought to bear at the Petascale and beyond in order to address fundamental challenges in achieving scalability and performance.
In-Store Compute – For many workloads, the overall costs associated with transferring massive amounts of data between storage and compute outweigh the costs of the compute itself; this is ushering in the era of In-Storage Processing™. HPC storage architectures must evolve to leverage what’s been learned from the specialized systems purpose-built to handle today’s Big Data problems. For many workloads, function shipping to where the data is stored can result in significant speedups even for moderately compute intensive functions. For computationally intensive HPC environments, it may be more important to use this capability to manage metadata, pre and post processing functions and achieve tightly integrated ILM services, such as data retention, archiving and retirement.
Global Object Stores – Traditional, tree-based POSIX file systems are yet another area that is creating a significant bottleneck to scalability. Moving towards a flatter namespace for applications, where individual processes transact with single objects eliminates many of these scalability challenges. For example, “Big Data” challenges have driven Web systems architects to build much more scalable, key-value data stores to resolve the traditional limitations that prevent doing business in these highly scalable, fully distributed environments.
Knowledge Management – The explosion of data in HPC must not only be efficiently stored and made available for access, but the content must be effectively managed and curated. As data is shared and updated over time, the provenance of the data becomes ever more critical to the reliability of the experiments. Although there have been some successful efforts toward data model definition and taxonomies in some fields, most of the existing content management systems – and the knowledge derived from that content – have been ad hoc and added later in the cycle. Adding knowledge management into the foundation of the information architecture enables team collaboration and sharing valuable insight between scientists – that would otherwise have been overwhelmed by the sheer volume of raw data.
Next Generation Solid State Storage – There are numerous opportunities to improve the storage hierarchy using the next generation of solid-state devices. Fully utilized, these will enhance the capabilities glimpsed at with FLASH-based SSDs to lower cost, improve reliability and dramatically improve performance. The next generation of Solid State Storage memory technologies promise performance similar to DRAM, superior non-volatile storage characteristics and architectures that will allow them to be very cost-effective. This has the potential to dramatically change the storage paradigm from a block centric I/O model to a byte addressable model.
Behavioral Systems Analytics – When looking at the entire Exascale complex, it is clear that system management will need to take on a broader and more holistic view of managing the entire application process. To enhance systems management, software and hardware sensors could be applied at every critical juncture or level of the HPC architecture to provide state or metering data. By utilizing this sensor or agent-based telemetry, we should be able to capture evidence of the system behavior, analyze & visualize it and derive proactive operational optimizations to the overall HPC process.
DDN has worked with some of the leading HPC users in the world, listening to their needs and building HPC systems to help them deliver better results, faster. It is clear that the move to Exascale will require disruptive innovation in the HPC architecture, and the I/O subsystem in particular, to reach this next level of capability.
As such, DDN offers a series of questions to the HPC community:
- What types of processing phases (pre, core, post) would benefit from In-Storage Processing as an alternative to moving data from storage to cluster for every type of function?
- Can applications evolve away from POSIX to a native object interface to enhance scalability and system performance?
- Should knowledge management and provenance tracking become a component of the system’s foundation managed collaboratively with the application?
- What new capabilities do future non-volatile memory technologies and the change in the storage I/O paradigm from block access to byte-addressable memory bring to HPC?
- Does the community believe that HPC systems efficiency can be improved by leveraging telemetry and behavioral analytics?
DDN looks forward to an open discussion (at ISC’12 and beyond) about the key issues and opportunities confronting the HPC community as it moves toward Exascale.