Follow Datanami:
November 15, 2011

The Big Data Opportunity for HPC

Intersect360 Research

The world is changing, this time at the speed of data…

With the explosion of new devices, sensors, GIS-based services, social networks, and cutting edge tools that deliver real-time feeds over the web and into datacenters large and small, global data generation continues unchecked. Despite this near-constant flood of data, many enterprises are looking ahead to the wealth of possibilities this kind of diverse, deep, often real-time view into business-critical trends, behaviors and patterns can yield.

Across the range of big data applications – in enterprise and research analytics, in real-time analytics and complex event processing, in data mining and visualization – organizations are seeking to make their data actionable. That is, they seek to turn superior information into superior decisions. And although many of these organizations many not think of themselves as traditional “HPC” users, there are great opportunities for HPC technology vendors to apply their technologies to these new categories of problems.

Organizations that might have once considered high performance computing to be the exclusive domain for governments and research centers are now seeing the competitive edge that comes from HPC hardware and services—as well as the analytics advantages. Instead of trying to find the “missing middle,” in some ways, the increased adoption of big data analytics is leading that large segment of potential HPC users directly to the source of performance, latency and data-intensive computing experts in the supercomputing industry.

Categories of Big Data Applications

It would be far too simple to point to the term “big data” and claim that it referred only to size. In fact, the size alone is only a small part of the problem, especially for HPC vendors who are used to working with customers that handle large datasets. When viewed from an application standpoint, data-intensive computing can be defined as representing applications in which data movement is the primary bottleneck, or in which computing within the short lifespan of data is challenging. Unlike compute-intensive applications, which can often involve datasets that are smaller in size relative to those of data-intensive problems, much of big data is dedicated to I/O and data movement versus the execution of computation.

An additional element to this definition is needed to capture the importance of the applications within the big data ecosystem. These are sets of applications that can be classified under a number of functional banners, including complex event processing, advanced enterprise analytics (i.e., BI platforms that leverage real-time processing of both structured and unstructured data), complex visualization, data mining, and related software tools for addressing stream processing.

The common thread that these applications share is their reliance on variable formats of data, often gathered from sources that are outside of an enterprise (sensor readings, social media data, etc.). The tie that binds these to HPC is recognizable in terms of basic system requirements that HPC vendors are used to addressing with traditional systems (recognition of performance, efficiency, latencies) but with the addition of a new set of frameworks for managing large sets of information (Hadoop and MapReduce, for instance).

The Merging of HPC and Big Data

Although not all big data is HPC, those vendors who have high performance technologies may find significant growth opportunities in big data application. Intersect360 Research is researching the link between HPC and big data, and there are common features shared, both the macro and specific component levels. The following list very generally highlights key elements to address and differentiate both.

Computation: Systems, Processors and Accelerators

Increasingly, big data applications, including complex event processing, real-time handing of transactional data, and immediate analysis of incoming flows from sensors, social networks and other sources require not only highly efficient ways of handling the variable data streams at the software level—they require a high performance system that provides the needed power as efficiently as possible. In many cases, computation needs to be done within a very short life cycle of data. That is, if the computation takes too long, the analysis of the data becomes worthless.

Storage

Innovations in storage are at the heart of the next string of big data innovations. Again, this is an area where HPC storage providers have a chance to shine. New offerings that attempt to keep the data needed immediately as close to the point of processing as possible before tucking it away after use are important for many big data applications. Accordingly, advances in storage memory will be critical turning points for the storage industry as data sets grow.

Other trends will continue to evolve to meet growing demand, not just for point of processing storage—but for archiving. Tape, deduplication and other storage technologies are set to take center stage in coming years, representing a brilliant overlap between HPC and the more general enterprise market.

Interconnects

For many time-sensitive and data-heavy applications, the network is king. Without very low latency, many of the innovative applications that power real-time, mission-critical applications on massive data streams would be significantly delayed.

While the role of the network in big data is not difficult to delineate, questions remain about how vendors in this HPC arena can carry their products into the general market in a way that will be cost effective and provide a predictable ROI. The value of this segment is critical, and the interconnects market is in for a wild ride to differentiate among its offerings and capabilities as data grows and applications continue to push the latency envelope.

Cloud

As new users of high-performance technologies scale into big data application, many will utilize public or private clouds in place of dedicated in-house HPC architectures. However, to achieve this, they must overcome significant potential barriers in both bandwidth and security. Clouds will be part of the solution for many big data applications, but only when deployed selectively, with an eye toward the goal of efficiently generating actionable insights from data.

The Stack

HPC software vendors are in a unique position as new analytics packages, open source and proprietary alike, spring into being at an increasing clip. From the management layer and up, many changes are coming to the stack. ISVs are running a race to keep up not only with one another, but with emerging popular frameworks that are being added onto, sometimes replacing specific functions. For instance, Hadoop has provided the data management and processing capabilities that might have otherwise taken a phalanx of developers to accomplish within an organization.

Opportunities

With the swift expansion of the stable of big data applications on top of increasingly mature, stable big data platforms and frameworks, HPC vendors could have a golden opportunity to rein in customers based on their commitment to ultra-scale performance and attention to speed.

Tenuous customers investigating the potential of big data are seeking solutions that will help them extract meaning quickly, reliably and without massive overhead. This requires a fine-tuned approach that many HPC vendors are well-positioned to offer due to their historical emphasis on performance in the face of challenging application demands. However, the need for a differentiated technology approach is required for high performance computing to enter the purview of this broad base of potential customers.

For example, the emphasis in system, processor and accelerator technologies has been on increasing raw computation (usually flops) in recent years, especially as the community continues the push toward exascale capabilities. While computational horsepower is a consideration within some big data applications (especially the real-time, complex event processing algorithms) the size of the data requires extreme focus on the power envelope. While this is an important consideration in HPC as well, for smaller datacenters looking for minimum cost per watt, there needs to be recognition at the server level of balanced performance and power usage.

High performance storage also needs to maximize its value in terms of cost while providing solutions that balance the need for massive space to rapidly process immediately important data within the application, while intelligently moving out what is no longer needed to non-primary storage. These challenges are already being addressed by a number of vendors, some of which are not strong players in HPC. Again, a reshaping of strategy is required here.

Despite these alterations in approach, the “big data” trend that has been capturing the attention of enterprises is rapidly evolving, and businesses are realizing that their key to competiveness lies in their ability to leverage the massive amount of data that are being generated from an array of sources.

For vendors catering to both HPC and big data, this trend is reinforcing the need for investment in high-end systems with high performance storage, networks and applications. These systems must provide the capability to address the requirements of new breeds of applications that emphasize rapid processing of unstructured and structured data that is being fired into corporate datacenters and research facilities at unprecedented rates.

The applications have changed, the sources of data they utilize has evolved, and the systems, storage and network environments have also been altered by the needs of big data customers. It stands to reason that the overall approach should be tailored as well. HPC vendors face the possibility to appeal to a new subset of enterprise customers through recognition of their challenges—and the solutions they require.

The HPC vendor community might already be uniquely positioned to address the scalability, processing power, storage capacity and latency-eliminating needs of many of these broad application types, but there are potential technological roadblocks that must be addressed in advance.

Challenges

Enterprise and research organizations are faced with the incredible challenges of gathering, storing, managing, analyzing and acting on the flood of potentially valuable information. In fact, nearly every element of the big data challenges ahead requires a fresh approach to meeting additional demands for speed, capacity, efficiency and insight.

For example, tools that untangle the complexity of massive, unstructured and structured datasets by splitting jobs into smaller components and simplifying parallelization are on the rise. The most popular such framework at the moment is Hadoop, the open source project that is forming the backend core of many data-intensive operations within large-scale enterprise and research—and is attracting a growing number of converts for both the community version and supported, professional distributions.

In the HPC vendor ecosystem, solutions have already been developed that address some of the more practical needs for big data customers. High performance storage options exist that can handle terabyte datasets with relative ease, and the computation angle has already been solved through the ability to create highly parallel applications that leverage multicore and accelerator innovations. Still, to truly address the growth of this data (soon to be in the exabyte range) “good enough” may not do. More (and more efficient, cost-effective) storage, computational horsepower and lower latencies are required to meet the demands of the big data age where “real-time” is increasingly the key phrase.

Conclusion

To rise to the challenges of big data requires the ability to create or adapt to advanced frameworks designed to handle data of variable origin and structure. It also demands the types of high-performance systems, networks, and storage systems that previously were associated with HPC systems. It involves new types of applications that leverage data that do not mesh well into traditional databases and existing systems.

While the complexities of creating data-intensive systems are great, few others in the IT industry at large have the insight and capabilities to handle these new demands than those in HPC. Due to a tradition of focusing on the high end in terms of size and capability, HPC vendors have much to gain, as more enterprises that might never have considered themselves as “HPC users” look to them for big data solutions.

Datanami