This week the annual Supercomputing Conference (SC12) will kick off in Salt Lake City, Utah, drawing over ten thousand of the world’s leaders in high performance computing (HPC) research—and an increasing number of enterprise big data execs seeking new lessons from the established world of HPC.
No longer can we think of supercomputing as detached from real-world business use cases; the merging of these two spaces is unmistakable, in part because of the attention big data has heaped onto boosting performance for massive datasets (something the supercomputing world has been doing for several decades).
From the opening keynote tomorrow from celebrity physicist Dr. Michio Kaku to the advanced sessions on everything from exascale and Hadoop, this is sure to be a week packed with news about the most cutting-edge developments in both compute and data-intensive research.
You might be struck by the number of MapReduce, Hadoop and graph-focused sessions. While many of these are in a high performance computing (HPC) context, it’s not difficult to see the practical applications for many of these developments as they relate to large-scale enterprise big data problem-solving.
Even if you’re not attending, this breakdown of some of the most important sessions will give you an insider’s view into what some of the largest (not to mention most well-funded) universities, national labs, and vendor research and development teams are considering for the future. For those who plan to be present this week in SLC, a calendar file to see the time and day of each event is linked within each item for easier scheduling.
Let’s dive in, folks—here are 20 sessions (not ordered by importance) that hold promise for the enterprise big data buff seeking new lessons from the bleeding edge. I’ll be at the show all week; stop by the booth or grab me to talk about what challenges we should write about more often.
#1 – “Dr. Data’s” Technical Overview of Big Data
If any of you are planning on attending, there is one key event that takes place throughout the day on Monday (8:00 – 5:00) that I recommend highly from one of the undisputed leaders in scientific “big data” computing, Dr. Alex Szalay from Johns Hopkins University, whom we interviewed here about this time last year.
With help from Robert Grossman of the University of Chicago and Collin Bennett from the Open Data Group, a full introduction to some of the tools and techniques that can be used for managing and analyzing large datasets will be reviewed.
The team will also give an introduction to managing datasets using databases, federated databases (Graywulf architectures), NoSQL databases, and distributed file systems, such as Hadoop in addition to offering an introduction to parallel programming frameworks, such as MapReduce, Hadoop streams, pleasantly parallel computation using collections of virtual machines, and related techniques.
The team will also show different ways to explore and analyze large datasets managed by Hadoop using open source data analysis tools, such as R. We will illustrate these technologies and techniques using several case studies, including: the management and analysis of the large datasets produced by next generation sequencing devices, the analysis of astronomy data produced by the Sloan Digital Sky survey, the analysis of earth science data produced by NASA satellites, and the analysis of netflow data.
#2 Tackling Data Analytics at the Petabyte Scale
Petabyte sized data archives are not uncommon any more. It is estimated that organizations with high end computing (HEC) infrastructures and data centers are doubling the amount of data that they are archiving every year. To add further complexity, the infrastructure required to deal with this data are becoming more heterogeneous to keep pace.
Today (Monday) Scott Klasky and Ranga Raju Vatsavai from Oak Ridge National Lab are joined by Manish Parashar from Rutgers University for a day-long workshop that will address growing data sizes and the infrastructures required to work with them. The team says that in addition to covering the general hardware (and cloud side), the emphasis of this workshop will be on the middleware infrastructure that facilitates efficient data analytics on big data.
The workshop will likely draw researchers, developers, and practitioners from academia, government, and industry to discuss new and emerging trends in high end computing platforms, programming models, middleware and software services, and outline the data mining and knowledge discovery approaches that can efficiently exploit this modern computing infrastructure. For those who take a large-scale analytics slant to the enterprise side, there are sure to be many lessons offered from this day-long event.
#3 Accelerating MapReduce with GPU & CPU
Heterogeneous architectures that integrate the CPU and the GPU on the same chip are emerging, and hold much promise for supporting power-efficient and scalable high performance computing but also for data-intensive scientific and some enterprise application areas.
A team from Ohio State University led by Michael A. Heroux from Sandia National Laboratories argues that MapReduce has emerged as a suitable framework for simplified parallel application development for many classes of applications, including data mining and machine learning applications, all of which can benefit from accelerators.
The team will present the fruits of their research, which will address the challenge of scaling a MapReduce application using the CPU and GPU together in an integrated architecture. They will describe different methods for dividing the work, one of which is a map-dividing scheme, which divides map tasks on both devices. They will also detail the pipelining scheme, which pipelines the map and the reduce stages on different devices. They will also describe how they developed dynamic work distribution schemes for both the approaches and provide details about how they were able to achieve high performance using a runtime tuning method to adjust task block sizes.
#4 Maximizing Performance at Internet Scale
Internet services directly impact billions of users and have a much larger market size than traditional high-performance computing. However, according to Jack Dongarra from the University of Tennessee at Knoxville and Zhiwei Xu from the Chinese Academy of Sciences, these two fields share common technical challenges.
The duo will argue that exploiting locality and providing efficient communication are common research issues. Internet services increasingly feature big data computing, involving petabytes of data and billions of records. This presentation will focus on three problems that recur frequently in Internet services systems: the data placement, data indexing, and data communication problems, which are essential in enhancing performance and reducing energy consumption.
After presenting a formulation of the relationship between performance and power consumption, they will provide examples of high-performance techniques developed to address these problems, including a data placement method that significantly reduces storage space needs, a data indexing method that enhances throughput by orders of magnitude, and a key-value data communication model with its high-performance library. Applications examples include Facebook in the USA and Taobao in China.
#5 Programming Techniques for Big Data on Hadoop
Gilad Shainer of Mellanox and the HPC Advisory Council will present in tandem with Eyal Gutkind (also of Mellanox) and Dhabaleswar K. Panda from the Ohio State University on some important parallels on the programming and performance requirement sides between big data and high performance computing.
The team will present the case that Hadoop MapReduce and high performance computing share many characteristics such as large data volumes, variety of data types, distributed system architecture, required linear performance growth with scalable deployment and high CPU utilization.
Their presentation will gel around the idea that RDMA capable programing models enable efficient data transfers between computation nodes.
To highlight this, they will discuss a collaborative work done among several industry and academic partners on porting Hadoop MapReduce framework to RDMA, the challenges, the techniques used, the benchmarking and testing.
#6 Panel on HPC and Big Data I/O Challenges
The requirements for ExaScale and Big Data I/O are driving research, architectural deliberation, technology selection and product evolution throughout HPC and Enterprise/Cloud/Web computing, networking and storage systems.
Analysis of the Top500 interconnect families over the past decade reveals that it has been an era when standard commodity I/O technologies have come to dominate, almost completely.
Speeds have gone from 1 Gigabit to over 50 Gigabits, latencies have decreased 10X to below a microsecond and software has evolved towards 1 software stack - OpenFabrics.
Enterprise is now adopting these same capabilities at a rapid rate, From the perspective of the major suppliers of systems and machines to HPC, this panel, which will be manned by Bill Boas (Moderator) from the InfiniBand Trade Association, Peter Braam of Xyratex, Sorin Fabish from EMC, Ronald Luijten of IBM Zurich Research Laboratory, Duncan Roweth of Cray Inc., and Michael Kagan from Mellanox Technologies will discuss the next generation of interconnects and system I/O architectures.
#7 Open Source Visualization, Served National Lab Style
To many outside of the HPC or supercomputing world, the computer science business at national labs might appear far removed from daily enterprise computing reality. However, with the growth of more (and far more complex, faster-moving) data, it’s more important than ever to watch what’s happening, especially in terms of open source tool development and use since both scientific and big business computing are moving ever-closer together.
A team of six representatives collected together from Lawrence Livermore National Lab, Oak Ridge National Lab and the Swiss National Supercomputing Center will present on open source visualization for massive, complex datasets. For the enterprise big data folk, this might provide an interesting break from the Tableau and enterprise set of viz tools—not to mention one that’s robust enough to tackle the incredible data the scientific viz folks routinely handle
More specifically, the full-day event (Monday) will introduce you to VisIt, an open source scientific visualization and data analysis application that is used to visualize simulation results on wide range of platforms from laptops to many of the world’s top supercomputers. The team will go through everything from VisIt for data exploration, quantitative analysis, comparative analysis, visual debugging, and communication of results and will discuss advanced VisIt usage and development, including writing new database readers, writing new operators, and how to couple VisIt with simulations executing on remote computers for in-situ visualization. Not for the faint of heart or skill, but sure to offer some food for thought.
#8 Comparing Hadoop Workloads on Three Research Clusters
A team comprised of researchers from Carnegie Mellon and the University of Washington will describe their work, which has focused on analyzing Hadoop workloads from three different research clusters from an application-level perspective.
In setting about this research, the team had two goals: First, they wanted to explore new issues in application patterns and user behavior and second, they hoped this comparative analysis would help them to better understand key performance challenges related to I/O.
The team’s analysis suggests that Hadoop usage is still in its adolescence. For instance, they will describe the underuse of Hadoop features, extensions, and tools as well as significant opportunities for optimization. They also claim to have seen significant diversity in application styles, including some ``interactive'' workloads, motivating new tools in the ecosystem.
Overall, the researchers concluded that some conventional approaches to improving performance are not especially effective and suggest some alternatives, but they do see significant opportunity for simplifying the use and optimization of Hadoop.
#9 Networking Big Data at FermiLab
For those interested in networking for big data within large-scale enterprise contexts, this is sure to be a zinger of a session since it deals with one of the largest big data challenge sites in the United States.
A team from Fermi National Lab will make the argument that exascale science translates to big data. In this case, specifically with the Large Hadron Collider (LHC), the lab’s data is not only immense, it is also globally distributed.
Fermilab is host to the LHC Compact Muon Solenoid (CMS) experiments US Tier-1 Center, the largest of the LHC Tier-1s. The Laboratory must deal with both scaling and wide-area distribution challenges in processing its CMS data. Fortunately, evolving technologies in the form of 100Gigabit ethernet, multi-core architectures, and GPU processing provide tools to help meet these challenges.
The team will describe current Fermilab R&D efforts in these areas including the optimization of network I/O handling in multi-core systems, modification of middleware to improve application performance in 100GE network environments, and network path reconfiguration and analysis for effective use of high bandwidth networks.
#10 Addressing Big Data with a Hybrid Core Approach
The spread of digital technology into every facet of modern life has led to a corresponding explosion in the amount of data that is stored and processed.
Understanding the relationships between elements of data has driven High Performance computing beyond numerically intensive computing into data intensive algorithms used in fraud detection, national security, and bioinformatics.
In this session, Kirby Collins from Convey Computer will present the latest innovations in Hybrid-Core computing, and describe how high bandwidth, highly parallel reconfigurable architectures can address these and other applications with higher performance, lower energy consumption, and lower overall cost of ownership compared to conventional architectures.
#11 Big Graph Analytics on Hadoop
It’s easy to overlook poster and research presentations during a busy show, but if you get a chance, stop by and see Torsten Hoefler from ETH Zurich, and research authors Joshua Schultz, Enyue Lu (both from Salisbury University) and Jonathan Vierya from California State Polytechnic Pomona so they can tell you about analyzing patterns in large-scale graphs by way of MapReduce.
The team makes a case for successfully analyzing patterns in large-scale graphs, such as social networks (e.g. Facebook, Linkedin, Twitter), arguing that this has many applications including community identification, blog analysis, intrusion and spamming detections. This is important because it is thus far impossible to process information in large-scale graphs with millions even billions of edges with a single computer.
The researchers will tell you about how they use MapReduce to detect important graph patterns using open source Hadoop on Amazon EC2, thereby also proving how MapReduce cloud computing with the application of graph pattern detection scales on real world data. They have an interesting use case to back this up. A definite meet and greet opportunity with possible practical business value.
#12 Overview of Graph Analytics in Big Data
Big data has grown enormously in importance over the past 5 years. However, most data intensive computing is focused on conventional analytics: searching, aggregating and summarizing the data set.
As Amar Shan from YarcData and Shoaib Mufti from Cray will address in their session, graph analytics goes beyond conventional analytics to search for patterns of relationships, a capability that has important application in many HPC and enterprise areas ranging from climate science to healthcare and life sciences to intelligence.
The purpose of this session is to bring together practitioners of graph analytics. Presentations and discussions will include system architectures and software designed specifically for graph analytics; applications; and benchmarking.
#13 Storage Vaccinations for the Plague of Petabytes
In HPC and big data alike, it is stunningly easy for users to create/store data, but as Matthew Drahzal from IBM will argue in his presentation, these same users are completely unaware of challenges and costs of all this spinning disk.
Drahzal says that many treat all created-data as equally important, and critical to retain, whether or not this is true. Since the rate of growth of data-stored is higher than the areal-density growth rate of spinning disks, organizations are purchasing more disk and spending more IT budget on managing data. While cost for computation is decreasing, cost to store, move, and manage the resultant information is ever-expanding.
IBM Research and Development, which is where the speaker hails from, are working on new technologies to shift data cost-curves fundamentally lower, use automation to manage data expansion, and leverage diverse storage technologies to manage efficiencies - all "behind the scenes", nearly invisible to end-users. This presentation will describe new Data Technologies being developed and perfected, and how these changes may fundamentally reset data costs lower.
#14 Visualization and Analysis of Massive Data Sets
Traditionally, filtering of large numerical simulations stored in scientific databases has been impractical owing to the immense data requirements. Rather, filtering is done during simulation or by loading snapshots into the aggregate memory of an HPC cluster
A team of researchers from Johns Hopkins University will describe a new query processing framework for the efficient evaluation of spatial filters on large numerical simulation datasets stored in a data-intensive cluster. The team will describe its system, which performs filtering within the database and supports large filter widths.
They will present two complementary methods of execution: I/O streaming computes a batch filter query in a single sequential pass using incremental evaluation of decomposable kernels, summed volumes generates an intermediate data set and evaluates each filtered value by accessing only eight points in this dataset. They dynamically choose between these methods depending upon workload characteristics. The system allows the researchers to perform filters against large data sets with little overhead: query performance scales with the cluster’s aggregate I/O throughput.
#15 Data Analysis through Computation and 3D Stereo Visualization
Technological innovations have begun to produce larger and more complex data than can be analyzed through traditional methods.
Jason Haraldsen and Alexander Balatsky from Los Alamos National Laboratory will address this in their presentation, which will discuss the advancement of data analysis through computation and 3D active stereo visualization.
The duo plans to demonstrate the combination of computation and 3D stereo visualization for the analysis of large complex data sets and will present specific examples of theoretical molecular dynamics, density functional, and inelastic neutron scattering simulations as well as experimental data of scanning tunneling microscopy and atom probe tomography. They also plan to present an open discussion of visualization and the new frontier of data analysis.
A large team consisting led by Unit Catalyurek from the Ohio State University (and manned by researchers from the IBM T.J. Watson Research Center and Indiana University) will present a deep dive on graph explorations with a distributed in-memory approach on Tuesday morning.
The team behind the research will describe the challenges involved in designing a family of highly-efficient Breadth-First Search (BFS) algorithms and in optimizing these algorithms on the latest two generations of Blue Gene machines, Blue Gene/P and Blue Gene/Q.
On Blue Gene/P, the researchers were able to parallelize the largest BFS search, running a scale 38 problem with 238 vertices and 242 edges on 131,072 processing cores. Using only four racks of an experimental configuration of Blue Gene/Q, they also achieved the fastest processing rate reported to date on a BFS search, 254 billion edges per second on 65,536 processing cores. Perhaps most valuably, they will also describe the algorithmic design and the main classes of optimizations that were used to achieve these results.
It should be noted that this is the same team that had stellar results with their Graph 500 submissions in November 2010, June 2011, and November 2011, presenting some impressive scalability results in both space and size. Worth it for the truly performance-conscious, assuming, of course, you’re an IBM shop.
#17 Another Deep-Drive on Breadth-First Search
Another session that covers breadth-first search at the massive scale will be led by Unit Catalyurek, but will be supported by a team of researchers from Intel.
This team will consider graph-traversal in a different light, noting that it goes far beyond the world of supercomputing benchmarks that test data-intensive computing performance. Graph transversal is used in many fields including social-networks, bioinformatics and HPC. The push for HPC machines to be rated in ``GigaTEPS" (billions-of-traversed-edges-per-second) has led to that Graph500 benchmark, which we will discuss in more detail this week when those announcements are made.
Intel comes into the picture when they note thatgGraph-traversal is well-optimized for single-node CPUs. However, current cluster implementations suffer from high-latency and large-volume inter-node communication, with low performance and energy-efficiency. To address this, the team will describe how they used novel low-overhead data-compression techniques to reduce communication-volumes along with new latency-hiding techniques. Keeping the same optimized single-node algorithm, they were able to obtain 6.6X performance improvement and order-of-magnitude energy savings over state-of-the-art techniques.
This particular Graph500 implementation achieves 115 GigaTEPS on 320-node Intel-Endeavor cluster with E5-2700 Sandybridge nodes, matching the second-ranked result in the November-2011 Graph500 list with 5.6X fewer nodes. That per-node performance only drops 1.8X over optimized single-node implementations, and is highest in the top 10 of the list and the team claims near-linear scaling with node count while on 1024 Westmere-nodes of the NASA-Pleiadas system, we obtain 195 GigaTEPS.
#18 Big Data, Big Opportunities for Research, Enterprise
With such staggering data growth rates, it is clear there has never been more data available, but also no greater imperative to access, analyze and distribute it more efficiently. Especially in high-performance computing (HPC) environments, data stores can grow extremely rapidly and though compute server technology has kept pace, storage has not, creating a barrier between researchers and their data.
This session, which will be led by Vinod Muralidhar of EMC Isilon will examine how implementing scale-out storage can eliminate the storage bottleneck in HPC and put data immediately into the hands of those who need it most.
While this is a vendor-focused presentation, the overall description of the challenges and solutions that might be possible for some who are considering scale-out storage could make this very much worth the time.
#19 TCP and Big Throughput for Big Data
Saturating high capacity and high latency paths is a challenge with vanilla TCP implementations, which is primarily due to congestion-control algorithms which adapt window sizes when acknowledgements are received.
With large latencies, the congestion-control algorithms have to wait longer to respond to network conditions (e.g., congestion), and thus result in less aggregate throughput.
A team from Virginia Tech led by Torsten Hoefler from ETH Zurich will present their case that throughput can be improved if one can reduce the impact of large end-to-end latencies by introducing layer-4 relays along the path. Such relays would enable a cascade of TCP connections, each with lower latency, resulting in better aggregate throughput.
The team argues in their presentation, which addresses Cascaded TCP, that this would directly benefit typical applications as well as Big Data applications in distributed HPC and plans to reveal the empirical results supporting their hypothesis.
#20 Cyber Security’s Big Data, Graphs, and Signatures
Cyber security increases in complexity and network connectivity every day. Today’s problems are no longer limited to malware using hash functions.
Interesting problems, such as coordinated cyber events, involve hundreds of millions to billions of nodes and similar or more edges. Nodes and edges go beyond single attribute objects to become multivariate entities depicting complex relationships with varying degree of importance.
Daniel M. Best from Pacific Northwest National Laboratory will unravel cyber security’s big data, novel and efficient algorithms are needed to investigate graphs and signatures. In his presentation, Best will bring together domain experts from various research communities to talk about current techniques and grand challenges being researched to foster discussion.