Follow Datanami:
November 26, 2018

Self-Service Data Preparation – At Scale or Sampling?

Piet Loubser

The phrase “data is the new oil” has become the favorite business transformation cliché of the past 10 years. The truth is that data in its raw form is about as useful for decision making as oil is for propelling a car. Data preparation is key to making data useful and involves the process of collecting, cleaning, blending, and ultimately making it available for the intended use – whether for analytics or another application.

While IT-centric data prep has been around for decades, self-service data prep tools that are designed specifically for business users and analysts is new. Despite the benefits and the new ease of use features, not all self-service data prep solutions are equal when it comes to dealing with real life data types, complexity and volumes.

Dealing with the Data Explosion

Traditionally, digitized data whether collected or captured via some application, has predominantly been about collecting “records” of our business and providing useful instrumentation of our business process. In addition to the growing number of applications, we now have clickstreams and IoT-style data instrumenting behaviors of machines and people that we never had before.

The end result is that we have more data to inform us about our business, customers, and products, but it is spread across this landscape with growing number of locations, types and exploding volumes. All of which has left businesses struggling to find a way to analyze the data fast enough.

Self-Service Versus Rules-Based Prep

According to The Forrester Wave: Data Preparation Tools, Q1 2017, “The data preparation tools market is growing and evolving because [customer insight], marketing, and operations professionals recognize the urgency of speeding up their ability to use data for actionable insights….[yet] traditional data management overhead gets in the way of analytical agility and fast insights for action with outcomes.”


In the past, IT-centric data prep was done using ETL (extract, transform, load) tools or SQL scripting. Data ingest, and transformations were done based on programming the rules for things such as joins, matching, handling misspellings in the data, etc. Once the set of rules were programmed the job would run in batch to produce a resulting data set.

Rules were defined based on previous experience with the data or knowledge of the data. More often than not, rules are iteratively created and refined when an organization starts working with the data and see the final results and remaining problems in the data.

Self-service data prep on the other hand displays the real data visually, and typically use an Excel-like interface. The upside is that you can now see data in real time, spot misspellings, overlaps, or see common data elements that will allow you to combine or join data sets. Newer tools also add machine learning to help with these steps through smart suggestions. Self-service tools enable users to interactively fix these types of data problems, so they can immediately see the outcome. This result in a much easier and faster end-to-end process for creating and delivering data sets without relying on technical staff.

Sampling: It’s Just Easier Batch Data Prep

The biggest challenge when dealing with data visually and interactively is how to manage the performance on a large number of rows of data. Excel manuals suggest it can handle a million rows, but realistically most people will only work on much smaller sets of data if they want any kind of calculations like pivot tables. Obviously, many data sets in real life will be substantially beyond the one million row mark.

(By SFIO CRACHO/Shutterstock)

To overcome the performance challenge, most self-service data prep tools are architected to use some method of sampling and only bring the small sample data into the tool – usually some hard-coded number of rows. Users visually correct and blend the sample data, infer some data quality and data transformation logic, and then will pass that inferred logic in the form of a script – the ‘rules’ to an administrator to run it on the full data set.

The challenge is that there is no way for the sample to include all the data irregularities resulting in relying on an iterative approach to find and fix data issues – much like the traditional IT-centric ETL approaches. The result is that a sampling-based approach is really only changing the way the rules of your data prep process are created (visually), as they still remain reliant on an IT supported batch process and therefore is just an easier batch data prep option.

Full Interactive Self-Service Data Prep

On the contrary, full self-service data prep solutions, allow users to work on the full data set. In this way all machine learning powered recommendations or visual cues are based on all the data and hence it will reflect all irregularities. And when a fix is needed and performed, it is immediately corrected in the entire data set. More importantly this can be done by the actual business users themselves, not IT. More importantly it delivers maximum data prep productivity as users get the right data and quality in one iteration.

Turning Data into Organizational Insights

To remain competitive, and truly capitalize on your organization’s data oil means having the ability to derive value from the information at hand. But to quote Forrester again, “Doing that requires cutting overhead and inefficiencies from traditional data management timelines and, most important, driving from insights to actions that have business impact.”

Changing the familiar equation- where up to 80 percent of the time spent on analytical projects is expended on data prep – requires empowering business users with self-service methods that include full enterprise security and enables them to find and prep their data and then analyze that data in the tools of their choice. Failing to deliver these capabilities at enterprise data scale, simply results in a new way to recreate old bottlenecks.

About the Author: Piet Loubser is SVP, Global Head of Marketing at Paxata, the pioneer and leader in enterprise-grade self-service data preparation for analytics. 

Related Items:

Why Self-Service Prep Is a Killer App for Big Data

Danger and Difficulty Temper Data’s Huge Potential

The Role of Self-Service Data Preparation in Analytics Modernization