Why You Need to Trust Your Data
There’s something that often gets lost in discussions about artificial intelligence and advanced analytics: the importance of the data. Having good, clean data is absolutely essential, but all too often, companies lack trust in that critical resource, which can lead business leaders to make bad decisions or resort to following gut instincts.
No matter how good your analytics are, you’re not getting anywhere if you can’t trust your data. The goal for many companies when constructing predictive analytics is to get the data as clean as possible, a process that studies show can consume up to 80% of a data scientist’s time. But even if the data is 100% true, your model may give the wrong predictions if the data doesn’t accurately reflect the thing you’re trying to predict.
The importance of training data is easily overlooked, says Alyssa Rochwerger, the vice president of product at Figure Eight (formerly CrowdFlower), a San Francisco, California-based firm that develops a platform that companies use to annotate and label the big data sets used to train machine learning models.
Companies that don’t have a plan for how they’re going to train data at scale don’t have much of a shot at succeeding at AI, says Rochwerger, who cited a recent Gartner study that found 90% of artificial intelligence projects fail due to a lack of good or appropriate training data.
“There’s so much academic attention on the algorithms…and deep learning models figuring out the feature themselves,” she says. “But what’s often overlooked in the news is it all comes back to training data and the quality of your training data. If there’s bias in the training data or there’s missing data in the training data, or it’s not quite aligned with the business problem you’re solving, the project will fail.”
Not all bias is bad. In fact, the word “bias” also refers the statistical signal that data scientists look for to indicates a value that stands out from the pattern or from background noise. But when constructing predictive models that are to be used in production, data scientists will usually seek to eliminate unwanted bias – and in fact, the law requires them to do so in many circumstances, such when applying for a loan or insurance.
“There’s always bias in the data. That’s the whole point,” she tells Datanami. “One of the challenges is having unintended bias that may impact a particular group that’s not intended – gender, race, geography. or other protected classes. You fix it by having data sets that are representative, or an annotation group that’s represented. It’s something that’s very easily solved, but it requires asking a lot of upfront questions and thinking about it thoroughly ahead of time.”
Real World Challenges
The labeling stage of the data science process is critical to AI success, so businesses need to take extra pains to ensure the labeling of their training data is done in the proper manner. But getting it “right” for one group often means getting it “wrong” for another, which complicates matters, according to Mihir Naware, principal product manager at Adobe Experience Cloud.
“Let’s take the case of sentiment,” Naware says. “To one person, it might be a neutral sentiment, but to another it might be heightened, either positively or negatively That’s another source [of bias] so it’s very important to monitor the quality of the labeling very much keep in touch what’s going on and make sure there’s is continuous sampling to ensure that no biases creep in at that step.”
To keep the models generating good recommendations, it’s often advisable to segregate models by geography or by time. In Adobe’s customer journey offerings, the experiences that customers have over longer periods of time become fodder for the algorithms.
“The more and more data you get which represents the real world journey of a customer — of a brand with the customer — the better and better you get at predicting certain things or understanding drivers or changes in actions that markets might be looking at,” Naware says. “It all comes down to really understanding what the data is really capturing and not assuming that all the data is as good as the real world out there. It’s actually just a snapshot.”
Fit for Purpose
Not all data is equal. Some data needs to be governed closely and transformed into pristine values, while others can have big error bars and still be useful, says KPMG Principal of Data and Analytics Traci Gusher. That is to say, the data needs to be “fit for purpose.”
“If you’re talking about the data that’s going to be utilized to do your financial statement, your operational systems, or very important key management decisions, the data behind those analyses, those statistics, those reports, needs to be pristine,” she says. “They need to be tightly controlled, they need to be tightly governed, and there needs to be repeatable process, standardization, and stewardship around how that data is handled.”
But that high bar doesn’t have to be met for other types of use cases, such as using social media data for customer sentiment analysis, or some types of historical analysis. “That data doesn’t have to be so pristine,” she tells Datanami.
Despite the advances that have been made in machine learning and advanced analytics in recent years, many companies aren’t using them to make operational decisions simply because business leaders don’t trust it. Much of that distrust stems from suspicions about the data, Gusher says. “There is still very much a lack of trust in using it to make decisions,” she says.
Over the years, many companies have invested the time and resources to build the governance and develop the processes that allow executives to trust the results generated by traditional business intelligence applications, such as reporting, metrics, KPIs and dashboards. However, that level of governance and those types of processes have not yet been duplicated on the big data sets, which is one source of the disconnect.
“There are still a lot of organizations that still don’t even trust their BI. But when you start talking about the advanced analytics, there’s not as much trust in the data,” she says. “A lot of leaders don’t understand how the advanced analytics that are being presented to them work, what value is in it, and how it’s different than maybe basic analyses they’ve received in the past.”
Without trusted data to guide them, executives resort to make decisions the old way: following their gut instinct.
“Where trust in data is gaining momentum, so is the ability to trust data and analysis to make decisions,” Gusher says. “But the two go hand in hand. Where there’s still a lack of trust in the data, there is still very much a lack of trust in using it to make decisions, and those are the places where a lot of leaders are still using their gut that is informed by the data, versus letting the data lead in the decision.”
It’s not always wrong to follow gut instinct. Algorithms are often lousy at informing us what to do when cultural differences arise, which is one area where human experience should win out, Gusher says. But business leaders are more apt to succeed if they can be well-informed by accurate data.
There are several ways that companies are addressing that big data disconnect. For starters, business leaders are “organically educating” themselves on advanced analytic methods, including machine learning. It’s not uncommon to see financial executive taking Coursera, Gusher says. At the same time, outsiders with experience in advanced analytics are joining organizations and bringing their knowledge with them. That’s helping to create a common language for talking about big data and analytics, which fosters trust.
The ability of a business leader to query advanced analytic techniques – that is, to ensure that they are, indeed, giving good data-driven recommendations and aren’t just regurgitating random noise – is a critical step in the process, and one that has yet to occur in many organizations, Gusher says.
Learning to Trust Your Data
One Fortune 100 firm that Gusher worked with had this exact problem. “Their most senior leaders didn’t necessarily trust in a lot of the analyses they were seeing because, in some cases, it directly conflicted what their gut was telling them and they didn’t feel empowered to ask the right questions in order to figure out if it was analysis they should trust or analysis they should refute,” she says.
The company, which was an industrial manufacturer, also had a lot of duplication of effort going on in advanced analytics. “They weren’t speaking on the same types of terms between teams, so it didn’t sound like it was the same thing, but when you actually looked at what they did, it was identical,” she says.
To solve these related problems, the company instituted a top down education program where the executive board went through two full days of live training on advanced analytics, machine learning, and artificial intelligence, Gusher says.
“They agreed upon the right level of content to take to their leaders one level below them, and taught their leader how to teach their courses to continue educating on down, all the way to the people on the ground.”
No matter what stage of the analytic game you’re at – whether you’re still building out dashboards or implementing advanced AI – it all comes back to trusting your data. If you can’t trust your data, you’re not going to get very far. It’s as simple as that.