The Seven Sins of Data Prep
Data preparation is often considered a necessary precursor to the “real” work found in visualizing or analyzing data, but this framing sells data prep short. The ways in which we cleanse and shape data for downstream use have significant bearing on our final analytic output, and cutting corners on data prep can run up a huge cost for companies.
According to a report from the Harvard Business Review, bad data costs the U.S. roughly $3 trillion per year. primarily due to the time involved in correcting data and the consequences of errors leaking through to customers. Below, we’ve outlined what we consider a “data prep sin” — or what will surely affect the end result for worse.
Sin #1: Removing data
Removing records containing incomplete, erroneous, outlying, or extraneous records is one of the most common transformations in data preparation. However, removing data can introduce bias or affect downstream results in meaningful ways. While such filtering is not inherently a sin, it is dangerous to remove data without understanding the implications on downstream results and without maintaining clear data lineage to communicate the the rationale and logic behind the filters.
For example, Facebook made headlines in 2016 for overestimating the average viewing time of ads hosted on its platform. The reason? The reported average was calculated based only on viewers who watched for three seconds or more because they removed data for anyone who immediately turned their videos off. While the decision was likely a means to rule out accidental viewers, this simple exclusion severely skewed the end result for consumers who were not aware that this meaningful subset of data had been removed.
Sin #2: Inconsistent definitions
As more people in an organization prepare data for analytics, it is increasingly likely that they create
inconsistent versions of the same metrics or reports. For instance, two analysts might have different definitions of an active user, geographic regions or top line metrics such as revenue. It’s important to not just report metrics across your organization but also to report how those metrics were derived to prevent disparities in analysis or to at least understand why such disparities exists.
Sin #3: Overwriting the source
While your data will take new shape during the preparation process, it’s important to never destroy the raw source data. Instead, you should encode any processing in auditable, versionable and reproducible logic.
Sin #4: Transformation without documentation
To ensure that you can reproduce, understand and trust the results of your analytic pipelines it is important to track all processing of data throughout the preparation process. Data lineage, should be thoroughly tracked to explain how and when the data was transformed, as well as who transformed it, whether to meet external compliance standards or for your own internal assessment. Ideally, the capture of this information should happen automatically while data is being prepared, but if not, it’s important to require documentation about the source data and all subsequent processing.
Sin #5: Not looking at your data
It’s important to understand what data quality issues have been addressed and which ones may have been introduced throughout the data transformation process. You should always have eyes on the data by being able to profile and assess changes to the data with each new transformation. However, with large or evolving datasets, it’s often too difficult to manually assess all possible data quality issues. Leveraging a platform that will automatically identify potential issues throughout the transformation process is key so you can remedy these issues up front—the later you catch these, the more time consuming and expensive they are to correct. Additionally, with production preparation pipelines, it is even more important to have automated methods for detecting and potentially correcting data quality issues.
Sin #6: Batch processing during exploration
Data preparation is an exploratory task. Preparation products that require batch processing over the entire dataset during exploration inhibit exploration or require prohibitively expensive computation power as data volumes grow larger. Additionally, as more analysts in an organization prepare data, server-side batch processing of the entire data can lead to resource contention and concurrency issues. Where appropriate, it’s important to leverage sampling techniques that allow for immediacy with each interaction and client-side execution can be used to amortize computation costs across hundreds or thousands of users.
Sin #7: Dependence on a specific platform or engine
As technology evolves or requirements change, it is important that data preparation work should also not be tied to a specific data processing environment. Your work should be portable across deployment models and execution engines. It’s important to prepare data with logic that can be compiled down into the appropriate processing framework now, and into the future.
The downstream purpose for preparing data varies greatly, and so too will the ways in which data is cleansed and structured. But as you go through the process of preparing data, it’s important to avoid these seven sins for the best end result.
About the author: Sean Kandel is co-founder and chief technology officer of Trifacta, the global leader in data preparation. He completed his Ph.D. at Stanford University, where his research focused on user interfaces for database systems. At Stanford, Sean led development of new tools for data transformation and discovery, such as Data Wrangler. He previously worked as a data analyst at Citadel Investment Group.