Three Deadly Sins of Data Science
Over the last few years, data science (DS) has moved from small-scale R&D efforts in companies to production applications, strategic decision-making, and other avenues of business value creation. While it’s still early in the story of understanding how DS should operate inside a business, there are some clear “anti-patterns” or “sins.”
These problems have plagued organizations from tech titans to small startups. While non-technical leaders who work with DS often worry about learning the basics of technical material (you’ll find “TensorFlow for Poets” open on more than a few of their browsers), the problems are most often independent of whether you’re using the latest self-attention GAN neural network or good-old-fashioned linear regression.
Building On Top of Poor Data Quality
Long before the heyday of machine learning (ML), the expression “Garbage In, Garbage Out” was used by computer scientists to note that algorithms cannot improve the essential defects in their inputs. In the context of ML, this expression is typically used to refer to problems with the data used to train and / or evaluate ML systems.
This “sin” typically manifests around labeled data for DS systems that is either low-quality or buggy in some essential way (e.g, every link clicked in an application has a null domain). These issues prevent an ML algorithm from learning a pattern that accurately reflects the underlying problem. Based on anecdotal experience, organizations under-invest in the system and processes to ensure the data they’re feeding DS is of high quality. The irony is the end result undercuts the value derived from their investment.
The central challenge with data quality (explained here) is determining how to characterize or have metrics around it. Nonetheless, there are a few ways to avoid committing this “sin” in order of investment level:
- Manual spot-checking: Perform manual spot-checking of data, particularly on examples your ML model may be getting incorrect or that seem impossible (e.g, a website visit that lasts hours in your logs). Even with a large engineering team, it can be common to have user-activity logs that are visibly wrong even by just looking at a few examples. Fixing these issues can yield more impact much quicker than DS cycles.
- Measure inter-annotator agreement (Kappa): Another common problem is that a team may have an ample amount of good labeled data (collected manually or from crowd-sourcing), but two people in the company can blindly agree on how to label examples. Generally speaking, if you can’t get two labelers to blindly agree most of the time, it’s going to be challenging to expect good quality from a DS system. The problem may be there isn’t enough context or information in the data, or the guidelines for how to label could be improved. One more possible explanation is the problem itself may not be sufficiently well-defined.
- Invest in data quality (DQ) tests: For any large data warehouse situation, be sure to build DQ testing to ensure you’re aware of any changes to your data (e.g, new records with a NULL column) and integrity constraints (i.e., has the mean visit time changed in the last week?). Once the problem manifests in trained DS models, it’s usually much harder to track back to a data quality problem.
Not Having a Well-Defined Measure of Success
Data scientists should always be aware that it’s more important to have a good evaluation for a problem they’re working on than about the techniques they want to apply.
The gist of this sin is not investing in established metrics that allow you to objectively measure improvements in DS systems. Building out the infrastructure around those metrics to enable fair comparisons over time should be prioritized over any models or changes to existing systems. As an example, imagine you are working on a system to rank Groupon-style “deal recommendations” on the front page of a website. One needs to evaluate not only how often a user clicks on the deal, but how often they dwell on the landing page, or respond to the call-to-action on the deal.
As with any organizational challenge, in the absence of good metrics to track progress, it’s difficult to drive DS improvements. ML metrics aren’t typically identical to business metrics (like customers acting on a deal recommendation), but should be broadly correlated with those outcomes. If you’re a non-technical person working with a DS team, you may not be able to understand how the system or model works, but you should insist on understanding how the system is evaluated.
Not Understanding How Data Science Output Is Related To Business Outcomes
This third sin is the most common amongst organizations and companies that are actually otherwise quite strong at DS. The DS team may have an internal metric that can be improved, but what if it turns out you aren’t seeing the downstream business value from improvements on the DS metric? This situation is relatively common and can occur when the DS team doesn’t have a rich understanding of the relationship between DS metrics and business metrics.
For instance, as per the running “deal recommendation” example mentioned above, the DS team has found it can reliably improve the number of clicks on offer recommendations through ML model iteration. However, it might turn out that click increases are leading to more offer conversions or user retention. For the business, the clicks DS is a proxy metric for quality and understanding their business value-add improvements. However, there isn’t always a direct correlation between the proxy metric and the downstream business value improvements. For instance, your ML system may prefer clickbait-style images, which might increase clicks at the cost of bounce rate and ultimately user retention.
About the author: Aria Haghighi is the vice president of data science at Amperity, where he is responsible for leading the company’s data science team to expand core capabilities in identity resolution. He has more than 15 years of technology experience playing key advisory and leadership roles in both startup and enterprise companies. Most recently, Haghighi was Engineering Manager at Facebook where he was responsible for leading the Newsfeed Misinformation team, which uses machine learning and natural language processing to improve the integrity of content on the platform and tackle the prevalence of fake news, hoaxes, and misinformation. Haghighi has also held leadership and technical roles at some of the world’s biggest tech companies including Apple, Microsoft and Google.