Achieving Data Literacy: Businesses Must First Learn New ABCs
Do you speak data? That’s the essential question Gartner poses to data and analytics leaders in promoting data literacy, “the ability to read, write and communicate data in context, including an understanding of data sources and constructs, analytical methods and techniques applied — and the ability to describe the use case, application and resulting value.”
“Do you speak data?” is a good question, but it’s not the conversation starter I’d use in discussing the data literacy topic with businesspeople––the folks on the front lines, with P&L responsibility, under pressure to digitally transform their companies, somewhat magically, with data. We need to make sure that the people charged with digital transformation can communicate about data, and in the right language. I’d first ask, “Do you know the alphabet? Let’s go through the ABCs of data.”
A Is for Awareness
Data science and business leaders alike know “garbage in, garbage out,” eruditely defined by the Oxford Reference as a phrase “used to express the idea that in computing and other spheres, incorrect or poor quality input will always produce faulty output.”
Perplexingly, I see business leaders sometimes focused exclusively on the analytic model or artificial intelligence (AI) algorithm they believe will produce the insight they seek, without focusing on the data the algorithm will be fed. Is the algorithm appropriate for the data? Will it meet Ethical AI standards? Is there enough data and high-quality data examplars? No matter how innovative the model or algorithm, it will only produce results that are as accurate and unbiased as the data it consumes.
A modern data science project, therefore, is a lot like an old-fashioned computer programming project: 80% of the time should be spent to gather the proper data and make sure it is correct, admissible and unbiased.
While the 80% yardstick itself isn’t new, data usage and data standards are changing —and they are complicated. Companies should formalize their model governance standards and enforce them ahead of admitting data for a project, because customer data is no longer free from usage constraints. Companies must conform with regulations concerning customer consent and permissible use; increasingly, customers have an ability to be forgotten, or their data to be withdrawn from future models.
In short, customer data can be riddled with quality issues and biased outcomes, and can’t be used in the freewheeling ways and academic pursuits of decades past. Business leaders must be aware of these important facts, and cognizant of their company’s very strong governance around data and AI. If governance isn’t established, it needs to be.
B Is for Bias
Biased data produces biased decisions—perhaps best paraphrased as “producing the same old garbage.” Organizations and data scientists must recognize that if they build a model to exactly replicate bias, even inadvertently, their work product will continue to propagate bias in an automated and callous fashion.
There are helpful guidelines, for example, to help compliance officers avoid biased and other unethical uses of AI. Because bias is rooted in data, the best default is to treat all data as dirty, suspect, and a liability hiding multiple landmines of bias. The data scientist’s and organization’s job is to prove why their usage of specific data fields, and how the algorithms leveraging them, is acceptable.
It’s not an effortless task. Aside from obvious data inputs, such as race or age, other seemingly harmless fields can impute bias during model training, introducing confounding (unintended) variables that automate biased results. For example, cell phone brand and model can impute income and, in turn, bias to other decisions, such as how much money a customer may borrow, at what rate.
Furthermore, latent (unknown) relationships between acceptable data can also unintentionally impute bias. These dirty patterns hidden in data are not in full view, and machine learning models can find them in ways that human scientists will not anticipate. This is why it is so important for machine learning models to examine learned relationships, and not rely on the stated importance of data inputs to a model.
Finally, data that may not introduce bias today might in the future—what is the company’s continual data bias monitoring policy? Today many organizations don’t have any plan.
Clearly, there are many issues around data to consider, and be understood, by data scientists and business leaders alike. Policies around data usage and monitoring are pillars of a strong AI governance framework, a template for ethical use of analytics and AI by the company as a whole. These policies include establishing methods to determine if data is biased because the collected sample is inaccurate, or the wrong data is being sourced, or simply (and sadly) because we live in a biased world. Equally important, how does the governance framework additionally provide for identifying and remedying bias?
C Is for Callousness
Bottom-line business leaders are looking for the decision an analytic model will make and to automate it in AI. In the rush to seize the business insight from an analytic model and automate it, companies often are not building models robustly. They are neither scenario testing nor bias testing. These mistakes are to the detriment of the customers whom companies are trying to serve, because once the data and analytics are complete, business leaders are presented with a score that will operationalize decision-making. Score-based decisioning enables automation, but also facilitates automated bias at scale. Business leaders must be sensitive to the potential callousness of decisioning based on an abstracted score.
For example, COVID has unleashed some level of economic despair on every corner of the planet. Data has shifted, exposing the fact that many businesses don’t understand the impact of changes in customer data, performance data and economic conditions have on their model scores, and how to use them in automated decisioning. Callous busines leaders are those who stubbornly continue to apply model scores because “the model told me,” versus looking at how data and situations have changed for groups of customers, and adjusting their use of models in business strategy.
We also must ensure those decisions are properly recorded. For example, a customer may have purchased a new phone from the wireless service provider just prior to COVID. If that customer stops paying, how is that decision recorded, as fraud or credit risk default? Are certain groups of customers during COVID more suspectable to job loss due to their profession? Do we find that socioeconomic, ethnic or geographic bias is driving credit default or fraud rates due to sloppiness in labeling outcomes, plain and simple?
When bias, carelessness or abject callousness is employed in dispositioning cases, it results in even more bias as future generations of models are developed. I routinely see this chain of events in situations where credit risk default gets labeled as fraud. Certain groups of customers credit-default more than others due to profession or education; when they are mislabeled due to careless, callous, or biased outcome assignments, entire groups of customers are pigeonholed as more likely to have commited fraud. Tragically, organizations are self-propagating bias in future models through this callous assignment of outcome data.
In short, a model is a tool, to be wrapped in a comprehensive decisioning strategy that incorporates model scores and customer data. “When should we use the model?” and “When should we not?” must be questions understood by business leaders as data shifts. Equally important is the question, “How do we not propagate bias through callous outcome assignments and treatments?” The answers to these questions build a foundation for stopping the cycle of bias.
All Together Now
While the decisions rendered by analytic models are often a binary “yes” or “no,” “good” or “bad,” the issues around the proper use of data are anything but—they are complex, nuanced and cannot be rushed. As companies increasingly recognize that data literacy is the gateway to digital transformation, I am hoping that, over time, data scientists and business leaders can be on “the same (data governance) page” of a metaphorical corporate songbook: “Now I know my data ABCs, next time won’t you sing with me?”
About the author: Scott Zoldi is Chief Analytics Officer at FICO responsible for the analytic development of FICO’s product and technology solutions, including the FICO Falcon Fraud Manager product which protects about two thirds of the world’s payment card transactions from fraud. While at FICO, Scott has been responsible for authoring more than 100 patents with 65 patents granted and 45 pending. Scott is actively involved in the development of new analytic products utilizing Artificial Intelligence and Machine Learning technologies, many of which leverage new streaming artificial intelligence innovations such as adaptive analytics, collaborative profiling, deep learning, and self-learning models. Scott is most recently focused on the applications of streaming self-learning analytics for real-time detection of Cyber Security attack and Money Laundering. Scott serves on two boards of directors including Tech San Diego and Cyber Center of Excellence. Scott received his Ph.D. in theoretical physics from Duke University. Keep up with Scott’s latest thoughts on the alphabet of data literacy by following him on Twitter @ScottZoldi and on LinkedIn.
Related Items:
AI Bias Problem Needs More Academic Rigor, Less Hype
Three Ways Biased Data Can Ruin Your ML Models
Operationalizing Data-Driven Decisions: A 5-Step Methodology