May 21, 2018

The Seven Sins of Data Prep

Sean Kandel

(Ollyy/Shutterstock)

Data preparation is often considered a necessary precursor to the “real” work found in visualizing or analyzing data, but this framing sells data prep short. The ways in which we cleanse and shape data for downstream use have significant bearing on our final analytic output, and cutting corners on data prep can run up a huge cost for companies.

According to a report from the Harvard Business Review, bad data costs the U.S. roughly $3 trillion per year. primarily due to the time involved in correcting data and the consequences of errors leaking through to customers. Below, we’ve outlined what we consider a “data prep sin” — or what will surely affect the end result for worse.

Sin #1: Removing data

Removing records containing incomplete, erroneous, outlying, or extraneous records is one of the most common transformations in data preparation. However, removing data can introduce bias or affect downstream results in meaningful ways. While such filtering is not inherently a sin, it is dangerous to remove data without understanding the implications on downstream results and without maintaining clear data lineage to communicate the the rationale and logic behind the filters.

For example, Facebook made headlines in 2016 for overestimating the average viewing time of ads hosted on its platform. The reason? The reported average was calculated based only on viewers who watched for three seconds or more because they removed data for anyone who immediately turned their videos off. While the decision was likely a means to rule out accidental viewers, this simple exclusion severely skewed the end result for consumers who were not aware that this meaningful subset of data had been removed.

Sin #2: Inconsistent definitions

As more people in an organization prepare data for analytics, it is increasingly likely that they create

(Castleski/Sutterstock)

inconsistent versions of the same metrics or reports. For instance, two analysts might have different definitions of an active user, geographic regions or top line metrics such as revenue. It’s important to not just report metrics across your organization but also to report how those metrics were derived to prevent disparities in analysis or to at least understand why such disparities exists.

Sin #3: Overwriting the source

While your data will take new shape during the preparation process, it’s important to never destroy the raw source data. Instead, you should encode any processing in auditable, versionable and reproducible logic.

Sin #4: Transformation without documentation

To ensure that you can reproduce, understand and trust the results of your analytic pipelines it is important to track all processing of data throughout the preparation process. Data lineage, should be thoroughly tracked to explain how and when the data was transformed, as well as who transformed it, whether to meet external compliance standards or for your own internal assessment. Ideally, the capture of this information should happen automatically while data is being prepared, but if not, it’s important to require documentation about the source data and all subsequent processing.

Sin #5: Not looking at your data

(wavebreakmedia/Shutterstock)

It’s important to understand what data quality issues have been addressed and which ones may have been introduced throughout the data transformation process. You should always have eyes on the data by being able to profile and assess changes to the data with each new transformation. However, with large or evolving datasets, it’s often too difficult to manually assess all possible data quality issues. Leveraging a platform that will automatically identify potential issues throughout the transformation process is key so you can remedy these issues up front—the later you catch these, the more time consuming and expensive they are to correct. Additionally, with production preparation pipelines, it is even more important to have automated methods for detecting and potentially correcting data quality issues.

Sin #6: Batch processing during exploration

Data preparation is an exploratory task. Preparation products that require batch processing over the entire dataset during exploration inhibit exploration or require prohibitively expensive computation power as data volumes grow larger. Additionally, as more analysts in an organization prepare data, server-side batch processing of the entire data can lead to resource contention and concurrency issues. Where appropriate, it’s important to leverage sampling techniques that allow for immediacy with each interaction and client-side execution can be used to amortize computation costs across hundreds or thousands of users.

Sin #7: Dependence on a specific platform or engine

As technology evolves or requirements change, it is important that data preparation work should also not be tied to a specific data processing environment. Your work should be portable across deployment models and execution engines. It’s important to prepare data with logic that can be compiled down into the appropriate processing framework now, and into the future.

The downstream purpose for preparing data varies greatly, and so too will the ways in which data is cleansed and structured. But as you go through the process of preparing data, it’s important to avoid these seven sins for the best end result.

About the author: Sean Kandel is co-founder and chief technology officer of Trifacta, the global leader in data preparation. He completed his Ph.D. at Stanford University, where his research focused on user interfaces for database systems. At Stanford, Sean led development of new tools for data transformation and discovery, such as Data Wrangler. He previously worked as a data analyst at Citadel Investment Group.

The Role of Self-Service Data Preparation in Analytics Modernization

Applications: Data Mining

Technologies: Frameworks

Sectors: Healthcare, Retail

Vendors: Harvard Business Review, Trifacta

Tags: data cleansing, data prep, data preparation, data transformation, ETL

Only registered users may comment. Register using the form below.

Check off newsletters you would like to receive*
- HPCwire
- EnterpriseTech
- Datanami
- Technology Conferences & Events
- Advanced Computing Job Bank
- Technology Product Showcase
Email*
Name*
First Last
Organization*
Job Function*
Industry*
Country*
City*
State*
Province*
- Please check here to receive valuable email offers from Datanami on behalf of our select partners.

The Seven Sins of Data Prep

Sin #1: Removing data

Sin #2: Inconsistent definitions

Sin #3: Overwriting the source

Sin #4: Transformation without documentation

Sin #5: Not looking at your data

Sin #6: Batch processing during exploration

Sin #7: Dependence on a specific platform or engine

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

April 17, 2024

April 16, 2024

Sponsored Partner Content

Get your Data AI Ready – Celebrate One Year of Deep Dish Data Virtual Series!

Supercharge Your Data Lake with Spark 3.3

Learn How to Build a Custom Chatbot Using a RAG Workflow in Minutes [Hands-on Demo]

Overcome ETL Bottlenecks with Metadata-driven Integration for the AI Era [Free Guide]

Gartner® Hype Cycle™ for Analytics and Business Intelligence 2023

The Art of Mastering Data Quality for AI and Analytics

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Building an Operational Data Warehouse for Real-time Analytics

Can You Use Kafka as a Database?

Sponsored Multimedia

The Power of DataOps: Bring Automation to Life
No Comments

Tactical Steps for Cloud Migration
No Comments

Immuta Data Access Platform
No Comments

Data Mesh: Fact or Fiction?
No Comments

Contributors

Featured Events

Call & Contact Center Expo

AI & Big Data Expo North America 2024

AI Hardware & Edge AI Summit 2024

CDAO Government 2024

The Seven Sins of Data Prep

Sin #1: Removing data

Sin #2: Inconsistent definitions

Sin #3: Overwriting the source

Sin #4: Transformation without documentation

Sin #5: Not looking at your data

Sin #6: Batch processing during exploration

Sin #7: Dependence on a specific platform or engine

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

April 17, 2024

April 16, 2024

Most Read Features

Most Read News In Brief

Most Read This Just In

Sponsored Partner Content

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Sponsored Multimedia

Contributors

Featured Events

Share

Copy short link