Follow Datanami:
May 29, 2012

Open Source Testbed Targets Big Data Development

Nicole Hemsoth

Big data developers looking to put their concepst to the test could have a new friend in EMC and Greenplum, which have put forth a limited, free testbed cluster that features 24 petabytes of physical storage across its 1000 hardware nodes (boosted to 10,000 nodes with the virtual nodes added).

Greenplum says that the new service, dubbed the Greenplum Analytics Workbench, will not only be used to test the limits of scale-out infrastructure technologies, but to redefine current models of big data analytics running on the Hadoop framework.

As it stands, the current out-of-the-box incarnation of Hadoop still leaves a lot to be desired and enterprise users, while eager to tap into Hadoop to manage ever-growing workloads, are still hesitant due to a lack of validated code and well-tested tools running on the open source framework.

According to Scott Yara, Greenplum’s Vice President, the Analytics Workbench addresses a need for developers to have a platform to test potential big data analytics code. On the other side, Greenplum points to the well-known reality that Hadoop innovation requires a steady stream of solid contributions made by open source developers. However, the Apache Hadoop community has consistently faced the challenge of provisioning the required resources to validate new releases of the open source software.

As the company pointed out, “Without access to a large cluster for scale validation, the Apache community – and enterprise users – must wait for Hadoop user communities to sponsor an effort to run scale validations. This is done very infrequently and a lot of time is spent stabilizing releases for enterprise adoption.”

Yara says that the workbench blends troves of both structured and unstructured data that has been gathered from a wealth of data-heavy sources, including social media, sensors, and call centers. As he noted this week, “developers can test real-time big data analytics using the data” which will drive further big data software innovation.

The Greenplum Analytics Workbench is the work of combined efforts from a number of technology companies, including EMC (Greenplum’s parent company), Intel, Mellanox, SuperMicro, and VMware. Yara said that there are a handful of collaborating institutions behind the project as well, including MIT and Stanford.

The Analytics Workbench is being offered for free at the moment, but not just to anyone…potential users need to have their use of the platform approved in advance. Further, Greenplum is working closely with Apache on this project, which means that results will be made available to the open source community.

According to Apache and Greenplum, the Workbench will “enable the Apache Hadoop open source community to validate code to scale on a regular, ongoing basis.” They say that with the contributions certified at scale, enterprises can take this code and run with some degree of confidence.

Related Stories

Six Super-Scale Hadoop Deployments

SAS Extends Integration with Hadoop

Oracle Taps Cloudera to Bolster Big Data Play