Follow Datanami:
December 7, 2017

Data Prep Goes Serverless

via Shutterstock

The rise of platforms in which cloud providers manage the allocation of computing and storages resources has opened the door to new data services such as serverless data preparation tools. The list of self-service data preparations tools is growing as vendors offer varying approaches to whipping raw data into shape for analysis.

“These tools are aimed at reducing the time and complexity of preparing data and improving analyst productivity,” Gartner noted in a recent review of self-service data preparation tools. Vendors estimate that data scientists spend about 80 percent of their time preparing data for analysis.

Cloud-based serverless data prep tools appear to be making the most headway among data analysts seeking new ETL tools looking to wrangle their own data sets for analysis as an alternative to standard ETL routines developed to plumb data warehouses.

Among the tools gaining the highest marks in the recent Gartner survey of self-service data prep vendors were Lavastorm and Trifacta. Google recently announced the beta availability of a managed data wrangling service developed in collaboration with Trifacta called Google Cloud Dataprep.

The service is designed to accelerate data preparation for analysis using Google Cloud Platform, the partners said. The data prep tool also leverages serverless data processing engine, Google Cloud Dataflow, which manages computing resources as needed.

Google extended the Trifacta data prep service by adding support for BigQuery and cloud storage.

In one use case example, raw event data from Internet of Things and other devices was dropped into BigQuery where data descriptors were added and then combines with other data feeds to ease queries using tools such as Looker, the analytical tool vendor specializing in the Google database.

In a blog post, Mark Rittman, product manager for analytics at Qubit, said he used the configuration to set up BigQuery tables to receive data from via streaming inserts sent by a server running on a Google Compute Engine virtual machine. Using data from his Fitbit health tracker, he assembled data prepped by the Google tool using its “spreadsheet-like interface.”

What’s missing, Rittman noted, was support for cloud APIs such as support for Google (NASDAQ: GOOGL) natural language processing. He expects these and other upgrades to be added as Google extends the Trifacta code base to leverage more serverless analytics features

The embrace of serverless data prep tools underscores the steady enterprise shift of big data analytics away from on-premise Hadoop deployments to the public cloud. Gartner (NYSE: IT) estimates global public cloud services will grow 18 percent this year to $247 billion, and that cloud services will account for the majority of analytics purchases by 2020.

In a community survey released this week, the Cloud Native Computing Foundation reported that 70 percent of members are using Amazon Web Service’s (NASDAQ: AMZN) Lambda serverless platform while Google Cloud Functions, Microsoft (NASDAQ: MSFT) Azure Functions and Apache OpenWhisk are also gaining traction.

Recent items:

Cloud In, Hadoop Out as Hot Repository for Big Data

Looker Rolls New Google BigQuery Tools