Pivotal Opens Up HAWQ, MADlib
Pivotal Software has turned over its SQL-on-Hadoop engine along with its MADlib machine-learning tool to the open source community as it seeks to extend the reach of its interactive SQL engine deeper into the Hadoop ecosystem.
Pivotal, which announced in April it was teaming with Hortonworks by combining its big data suite with its partner’s Hadoop platform, said recently its contributing HAWQ engine and MADlib framework to the Apache Software Foundation, giving each “incubation” status within the open source group.
As hyperscalers like Netflix build data infrastructure that is tied directly to applications, Pivotal said its contribution of HAWQ and MADlib provides a proven SQL engine that would fill in missing building blocks in the Hadoop ecosystem.
HAWQ incorporates the SQL processor and relational query engine of the Pivotal’s original Greenplum database. “Greenplum on Hadoop has evolved significantly to a system recast in terms of Hadoop,” San Francisco-based Pivotal noted in a blog post announcing the open source contributions.
MADlib emerged from collaboration between researchers at the University of California at Berkeley, University of Wisconsin, University of Florida and engineers and computer scientists at Pivotal. Designed for in-database analytics, MADlib leverages the massively parallel-processing capabilities of the Greenplum database and HAWQ.
The open source contribution represents the “first big step toward building not only a Hadoop Native SQL engine, but ultimately an entire Hadoop Native, data center-class, high performance analytic database infrastructure,” Pivotal asserted.
It also cited the transformation of the database industry driven in part by the rapid rise of mobile and Internet of Things workloads along with the meshing of data with continuous delivery of applications. Those factors have combined to make Hadoop ” the fundamental substrate of new generation data warehousing,” the company said.
The Hadoop partnership with Hortonworks announced earlier this year is designed in part to move HAWQ away from a proprietary management and configuration framework to an open source, Hadoop-native environment. Pivotal claimed that would reduce the total cost of ownership in managing the Hadoop stack, including Pivotal HAWQ.
Meanwhile, MADlib is positioned as an open source library for scalable in-database analytics and is designed to provide “data-parallel implementations” of mathematical, statistical and machine learning methods for structured and unstructured data. The framework uses shared-nothing, distributed, scale-out architectures to offer a toolset for analytics problems involving very large data sets. MADlib is SQL-based and supports PostgreSQL as well as Apache HAWQ and Pivotal Greenplum databases.
The library’s SQL APIs are designed to allow it to work with on a wide range of data stores and SQL engines along with a common language on which to build. Pivotal said the tool kit includes algorithms for classification, regression, clustering, topic modeling, association rule mining, descriptive statistics and validation.