Follow Datanami:
April 12, 2021

Why AI Is Failing for Enterprises: Predetermination Bias

Jed Yueh


Our best practices for managing data are ancient. Literally.

For tens of thousands of years we’ve managed the future by predetermining the resources we think we’ll need, limiting our futures to what we can foresee. We’re now in the age of AI with a radically more open future. AI–especially machine learning–can learn from seemingly insignificant data, often to our human delight and surprise.

So, it’s time to rethink our processes and mindset when it comes to data. We need to stop limiting data to the futures we expect. That’s not easy because predetermination is a deep habit that guides much of our lives. For instance, we fill our kitchen pantries with the ingredients we think we need to save space, time, and money.

But that predetermination inhibits innovation. We’re unlikely to try, say, a new Indonesian recipe if we’re missing kaffir lime leaves.

Worse, our imagination itself is muted by our awareness of the supplies on hand. This is an even more pernicious type of predetermination because we often don’t recognize that it’s happening.

Call it predetermination bias.

Culling Bias

Predetermination bias is often hardcoded into the ETL (extract, transform, load) pipelines enterprises use to manage data today. Typically only a small stream of application data, about 5% to 10% of the total, makes it through the pipeline and lands in a data warehouse for analysis.

(Andrii Yalanskyi/Shutterstock)

Data lakes and NoSQL data stores improve upon the process by switching ETL to ELT. That’s a good start, but it’s not enough. Extracting data leaves most of the data behind. Loading results in yet another copy of data that keeps getting bigger and bigger. And transforming data still strips off what is unique in order to standardize it.

Data warehouses, data lakes, and NoSQL stores are useful, just like kitchen pantries. But they reduce the total available information.

In the Age of Machine Learning, we’re learning the importance of what might seem like insignificant data. Machine learning neural networks compute data weights from webs of correlations often too complex for our minds to unravel. The amount and complexity of the data they process is why machines can now beat the world’s human champions at chess and can diagnose diseases from data in ways that doctors and computer scientists cannot understand.

Then there’s the lesson from the tech giants. When your applications service billions of users, you have to rely on algorithms to make immediate judgments based on as much data as possible.

Now it’s time to end data predetermination for the rest of the world’s businesses. We need a new model for thinking about and acting on data.

Leave No Data Behind

If predetermination, ETL, and ELT are passé, then what does a modern data process look like?

Here are seven guidelines for modern data management:

Access to all the primary  data: Most of the valuable data in an enterprise sits in enterprise applications, from mainframe to cloud native. Instead of settling for a thin stream of ETL or ELT data, we need access to all of the primary data across multi-generational platforms. And we need an efficient way to access it that does not impact or add risk to production applications.


API-driven access: Scurrying  from department to department to learn the magical phrases to get data access takes too long. In today’s world, we need APIs (application programming interfaces) that provide simple, uniform ways to request data from all sources.

Data privacy and compliance: Regulatory compliance cannot be an afterthought. Today, companies must pursue responsible innovation by securing data used in analytics and training AI models. Enterprise data needs to be masked to keep personally identifiable and other sensitive information from reaching the wrong hands.

Data history and integrity:  Data changes quickly over time, and data relationships matter. When feeding data from different sources into machine learning, it’s critical to make sure all the data comes from precisely the same time to preserve relationship integrity. In addition, historical data can be used to iteratively train and test machine learning models to tune and improve outcomes.

Version control: The world continues to change, so models that work today may fail tomorrow. That means we need version control—access to the source data used to train the failing models so we can perform drift analysis so we can see what has changed in the data to properly retune and retrain our models.

Automation:  While machine learning is sophisticated and high tech, most of the daily work is manual and prosaic—data wrangling, preparation, cleanup, and separating datasets for training, testing, and validation. All of these operations add friction to the process. Automation overcomes that friction, enabling faster and more effective use of data.

Data anywhere: Today, enterprise applications live across the muli-cloud—SaaS, private clouds, and public clouds. And cloud vendors are constantly evolving and competing on AI technology offerings. So it’s critical for companies to be able sync compliant data wherever they need to best process data for strategic advantage.

These seven guidelines don’t spell the end of the data warehouse. There are times when we know exactly the data we need now and in the future. In those situations, predetermination still holds value.

But in the land of tech giants and AI, we need a new model to keep up with the data and the times.

About the author: Jedidiah Yueh has led two waves of disruption in data management, first as founding CEO of Avamar (sold to EMC in 2006 for $165M), which pioneered data de-duplication and shipped one of the leading products in data backup and recovery, with over 20,000 customers and $5B in cumulative sales. After Avamar, Jed founded Delphix, which provides an API-first data platform to accelerate digital transformation for over 25% of the Global 100 and has surpassed $100 million in ARR. In 2013, the San Francisco Business Times named Jed CEO of the Year. Jed is the bestselling author of Disrupt or Die, a book that refutes conventional ideas on innovation with proven frameworks from Silicon Valley. After being designated a US Presidential Scholar by George H. Bush, Jed graduated Phi Beta Kappa, magna cum laude from Harvard, while working three jobs, including teaching at a local high school.

Related Items:

Why You Need Data Transformation in Machine Learning

Three Ways Biased Data Can Ruin Your ML Models

Automating the Pain Out of Big Data Transformation