Univa
Language Flags

Translation Disclaimer

HPCwire HPC in the Cloud Digital Manufacturing Report Green Computing Report


September 17, 2012

Big Data – Scale Up or Scale Out or Both


Introduction

The “Big Data” term is generally used to describe datasets that are too large or complex to be analyzed with standard database management systems. When a dataset is considered to be a “Big Data” is a moving target, since the amount of data created each year grows, as do the tools (soft-ware) and hardware (speed and capacity) to make sense of the information. Many use the terms volume (amount of data), velocity (speed of data in and out) and variety of data to describe “Big Data”. Large datasets can be analyzed and interpreted in two ways:

  • Distributed Processing – use many separate (thin) computers, where each analyze a portion of the data. This method is sometimes called scale-out or horizontal scaling.
  • Shared Memory Processing – use large systems with enough resources to analyze huge amounts of the data. This method sometime called scale-up or vertical scaling.

Distributed processing

Depending on the type of data and the desired outcome where a distributed system would be the best fit for an organization. Simple searches through a list of records could be easily distributed to a set of systems, with the results from each server then collected. An example of this could be searching through all driver license records for those with blue eyes. A portion of the driver license records could be kept on each server, in memory, without issues such as overlap or dependency on other server results. Apache Hadoop is an open source software framework that supports data intensive distributed applications. It enables applications to work with hundreds to thousands of computational independent computers and petabytes of data. Hadoop was derived from Google's MapReduce and Google File System (GFS) papers.

Advantages and disadvantages of distributed processing

The main advantage of distributed processing is its ability to scale just by adding “one more node”. An additional node or server can be added to the cluster and the necessary modifications to scripts or applications can be quickly implemented. On the other hand it requires the skillsets and management capabilities to manage Hadoop cluster which require setting up the software on multiple systems, and keeping it tuned and running. It also worth noting that Hadoop is suitable for cases where data interdependency is low and requires small (if any) data replication.

Shared Memory processing

If the amount of data is complex, unstructured or where multiple algorithms are required to be used on the data, a large shared-memory system would be best. Much more of the data could be held in the memory of the system, and different processes could all operate on the same data, while the data resides in memory. For instance, monitoring thousands of video feeds to determine any correlation between the images would benefit from keeping all the feeds in main memory and having multiple applications all work with the data. By utilizing a shared memory approach, applications become easier to develop as well as debug.

Advantages and disadvantages of shared-memory processing

While it is much easier to manage single large-scale system and host all the data and processing on one machine, such systems tend to be quite expensive. The reduction in OPEX as result of single system to manage and a reduction in DBA complexity come at the cost of the hardware.

ScaleMP vSMP Foundation

vSMP Foundation from ScaleMP creates a virtual shared-memory system, from a distributed infra-structure, providing the best of both worlds for big-data and analytics problems. On one hand, it allows scale just by adding “one more node” but still keeps the OPEX advantage of a shared-memory system. It provides benefits for small Hadoop deployments where the OPEX costs are high, and can handle big-data cases where data cannot be easily distributed by providing a shared-memory processing environment.

How does it work?

vSMP Foundation creates a single virtual machine with CPUs, RAM and I/O aggregated from several smaller systems. This allows for more data to be held in memory, directly accessible by any of the CPUs in the aggregated system.

Virtual Machine 

Advantages and disadvantages of vSMP Foundation

With complex datasets that require a number of steps of processing and significant computations, the ability to hold larger datasets in memory, without having to swap to disk, greatly reduces the time to understanding trends within the data.

Smaller, more distributed types of analyses can also benefit from vSMP Foundation. For example, if a distributed algorithm is used, a certain number of servers must be maintained and administered. By using vSMP Foundation on a small to medium sized cluster, this greatly simplifies the administration cost, while the application will run at the same performance. By reducing the administration costs associated with a small cluster running a distributed algorithm, an increased ROI can be achieved, while continuing to run familiar big data analysis applications.

Usage Models for Aggregated Systems

When aggregation of systems is performed, there are a number of use cases that benefit anyone analyzing large amounts of data. The most popular case would be to gain access to all of the memory across all of the systems. Data can be stored in memory, significantly speeding up data access (as compared to hard disk drives). This is shown in the upper left diagram. Some applications may be more compute intensive and could benefit from using a large number of cores for processing. In this case a single application, running on a single instance of the OS can use all of the cores as well as accessing all of the memory. This is shown in the lower left of the image. Other applications that access large amounts of data could benefit from aggregating the I/O capabilities of individual servers into one I/O system. This allows for faster ingest of the data, as shown in the upper right of the diagram. Finally, in the lower right shows how using aggregation can simplify the management of a cluster, as compared to having to manage individual systems.

Virtual SMP 

Summary

With the increasing business requirements to analyze and understand significant volumes of data, it is important to create an IT system than can respond to those needs. By using vSMP Foundation to combine the low cost of scale-out systems with the advantages of scale-up systems, cost savings can be realized while maintaining a highly responsive system to make sense of Big Data information.

 

 

Distributed Processing

Shared Memory Processing

vSMP Foundation

Advantages

Low cost infrastructure (CAPEX) with pay as you grow characteristics

Single system to manage (OPEX)

  • Low cost infrastructure (CAPEX) with pay as you grow characteristics
  • Single system to manage (CAPEX)

Disadvantages

Management cost (OPEX)

Platform cost CAPEX)

 

For More Information:

To learn more about how aggregating scale-out servers can benefit and speed up big data analysis, download our whitepapers or visit the ScaleMP web site.

Share Options


Subscribe

» Subscribe to our weekly e-newsletter


Discussion

There are 0 discussion items posted.

 
Xyratex

Sponsored Links

Sponsored Whitepapers

Parallel Performance of the IMSL C Numerical Library with OpenMP

05/21/2013 | Rogue Wave Software

Download whitepaper containing benchmark results depicting the speedup achieved as a result of incorporating OpenMP directives in the IMSL C Numerical Library, for portable, cross platform analytics.

Download this Whitepaper...

Best Practices in Big Data Storage - Sponsored by Cleversafe, Cray, DDN, NetApp, & Panasas

05/10/2013 | Cleversafe, Cray, DDN, NetApp, & Panasas

From Wall Street to Hollywood, drug discovery to homeland security, companies and organizations of all sizes and stripes are coming face to face with the challenges – and opportunities – afforded by Big Data. Before anyone can utilize these extraordinary data repositories, however, they must first harness and manage their data stores, and do so utilizing technologies that underscore affordability, security, and scalability.

Download this Whitepaper...

View the White Paper Library

Sponsored Multimedia

SGI President and CEO, Jorge Titinger, on Big Data

SGI President and CEO, Jorge Titinger, talks about SGI's history and leadership in HPC and how that has converged into Big Data Solutions.

View Multimedia

Cray CS300-AC Cluster Supercomputer Air Cooling Technology Video

The Cray CS300-AC cluster supercomputer offers energy efficient, air-cooled design based on modular, industry-standard platforms featuring the latest processor and network technologies and a wide range of datacenter cooling requirements.

View Multimedia

More Multimedia

SGI DataRaptor with MarkLogic Database

Job Bank

Datanami Conferences Ad

Featured Events

June 4-4, 2013
The Economist's Information Forum
San Francisco, CA
United States

June 10-13, 2013
Cloud & Big Data Expo
New York City, NY
United States

June 17-18, 2013
Forecast 2013
San Francisco, CA
United States

June 19-20, 2013
GigaOM Structure
San Francisco, CA
United States

June 26-27, 2013
2013 Hadoop Summit
San Jose, CA
United States

June 26-27, 2013
Big Data World Congress
London
United Kingdom

» View/Search Events

» Post an Event