Follow Datanami:
November 10, 2017

Data Catalogs Scale in the Cloud


Data cataloging software for Hadoop and other big data systems emerged as a hot item at last year’s Strata + Hadoop World Expo.

Among the proponents of data cataloging, which is designed to help classify and organize most everything thrown into data lakes, is Waterline Data. The startup based in Mountain View, Calif., is releasing the latest version of its “smart data catalog” that has been upgraded with “multi-petabyte” scaling along with native cloud and on-premise support.

Data catalogs are catching on as enterprises struggle to extract mountains of data that are easy to deposit in Hadoop and other platforms but tough to track down later. Hence, data catalogs are being touted as something more than simply a data management tool within widening data lakes.

Waterline initially addressed data cataloging using “tags” to track the lineage of every piece of data. The latest version to be released Nov. 13 (version 4.03) incorporates machine learning along with scaling and native platform support.

Those upgrades are designed to accelerate classification and organization of data assets and their lineage, the company claims. Those improvements are intended to automatically profile and tag billions of rows of data. The startup also claims its approach reduces data processing time by as much as a factor of ten.

Along with on-premise support, the latest release of the data catalog extends native support of Microsoft Azure (NASDAQ: MSFT) and a preview edition on Google Cloud Platform (NASDAQ: GOOGL) in addition to previous support for Amazon Web Services’ (NASDAQ: AMZN) Simple Storage Service and Elastic MapReduce.

Support also has been added for MapReduce 5.2 along with earlier support for Clouder and Hortonworks (NASDAQ: HDP).

Alex Gorelik, CEO and founder of Waterline Data, told us last year that open source tools like Apache Atlas, which is backed by Hortonworks, and Cloudera Navigator provide a good technical foundation for addressing data cataloging and master data management (MDM). Challenges. However, Gorelik added they don’t go far enough to solve the problem, which is why Waterline uses “tags” to track the lineage of every piece of data.

Other up-and-coming MDM vendors include New York-based Collibra, which also receives high marks for its data governance tools, and Alation, that along with data cataloging also tracks queries that run in parallel with data as its collected.

Meanwhile, Waterline asserts its upgrade to “universal native support” is a “big deal” since it allows the platform to scale along with the growing volume of diversity of data. It also emphasizes the enterprise shift to hybrid and multi-cloud deployments.

Recent items:

Data Catalogs Emerge as a Strategic Requirement for Data Lakes

8 Tips for Achieving ROI in Your Data Lake