Follow Datanami:
August 9, 2023

Three Critical Factors to Consider When Preparing Data for Generative AI

Will Freiberg


Thanks in part to the excitement around breakthrough generative artificial intelligence (AI) tools like ChatGPT, industry analysts are projecting rapid growth of business investment in AI and machine learning (ML) technologies. IDC predicts spending this year will reach $154 billion, which is nearly 27% more than last year’s investment in AI/ML-related hardware, software, and services.

Keep in mind there’s a reason the organizations building generative AI tools are backed by deep-pocketed investors, have access to enormous datasets, and use exceptionally mature data management practices. The costs to train a large language model from the ground up would be prohibitive for most businesses. As explained in this “State of GPT” video from Microsoft, it’s an incredibly complex process that requires the investment of millions of dollars.

Most businesses that are assessing their data for AI/ML readiness will therefore be looking at ways to finetune a base model that already exists. For example, in the context of generative AI and language models, a company that wanted to finetune a model would need to nvest time and resources into evaluating training data in specific formats and continuously iterate in order to align their data with their preferred narrative. This would require clean source data to be fed into the language model.

There are three critical factors about data that companies should consider when preparing for an AI/ML initiative, and those who are leading the project should also ensure everyone involved is clear on the objectives and understands the processes and standards required from the jump. Here’s a closer look.


Three Factors to Save Time and Streamline Data Assessment

Data projects are typically complex, and since industry use cases vary significantly and each organization has internal idiosyncrasies and data maturity levels to consider, the task of assessing data can be a convoluted one. But here are three factors that should not be overlooked:

  1. Data accessibility: A common challenge companies encounter is data that is inaccessible because it is scattered across multiple, disparate systems or stored in a variety of incompatible formats. This scenario often occurs when companies grow through mergers and acquisitions, so information may be stored in multiple clouds and managed via different architectures. As a result, aggregating and standardizing into a single format becomes a daunting task, hindering the ability to effectively leverage the data for ML scaling.
  2. Data quality: The rise of domain-specific generative AI has highlighted the importance of having high-quality, curated data. The “garbage in, garbage out” axiom applies in AI/ML projects, and trouble can arise when businesses are pulling data from systems that weren’t designed for analytics. To shape data for analytics, project leaders may have to blend it with data from other sources, which then must be monitored over time to ensure it remains valid to avoid “data drift” or “model drift,” where the data the AI/ML tool was trained in no longer mirrors reality for the model’s purpose. Curating and maintaining high-quality data is crucial to ensure accurate and reliable AI/ML outcomes.
  3. Data quantity: Related to point #2, businesses frequently augment internal data with data from a variety of outside sources, including data offered by vendors and royalty-free public information. Quality and frequency issues can be a challenge when building data quantity from third-party sources, which might deliver data with time gaps or in different formats. Data from external sources also has to be transformed into a standard format and observed on an ongoing basis to ensure it remains fresh, usable, and relevant to the AI/ML initiative.

Data integration tools can be helpful in pulling information into a single data warehouse so project teams can start shaping it. It’s also critical to consider the regulatory implications of where the data is stored, and which standards are applied since jurisdictions have different rules.

Working Toward a Successful AI/ML Data Project


Gartner predicts that through 2025, 80% of businesses that attempt to scale their digital operations will fail due to a lack of modern data governance standards. To avoid a data misfire on an AI/ML project, it’s critical to define the objective and gain buy-in across the organization, setting clear goals for the program and creating consensus on value from the middle-management layers of the organization. Everyone must understand what the company will gain and how the project will benefit not only top management but all stakeholders across the organization.

It’s also crucial to assess data quality specifically for AI/ML project suitability. The fundamental question is whether the data not only has core quality attributes that are necessary for any analytics project but is also sufficiently complete, accurate, timely, etc., for use in training the model. From a data discovery perspective, project leaders may find data catalogs internally and externally that list the data type, but the information also has to be in a format that works for downstream users.

Another factor project leaders should consider is the availability of resources for projects of this scale. Skilled data engineers are in high demand, so for many businesses, it may make more sense to work with a partner instead of wasting precious cycles on lower-level data delivery and transformation tasks that can be a distraction from high-value analytics. An investment in data engineering tools that can automate the most manual and mundane tasks or a partnership with a data preparation expert can help businesses get to value faster with their AI/ML project.

Data projects are often a team sport because the more the business can focus on insights rather than the plumbing involved in delivering usable data, the more likely they are to achieve value quickly. That may be especially true for generative AI projects. The technology is exciting, but leveraging models for value also requires intensive human oversight.

About the author: Will Freiberg is a technology executive and entrepreneurial leader with significant cross-functional expertise across sales, product, business development, customer success, and strategic initiatives. He currently serves as CEO of Crux, a cloud-based data integration, transformation, and operations platform that accelerates the value realization between external and internal data. Prior to Crux, Will was Co-CEO at D2iQ (formerly Mesosphere). During his six-year tenure at D2iQ, he held various leadership positions and led the company through hypergrowth as it helped define the cloud-native container industry.

Related Items:

Data Management Implications for Generative AI

Proactive CIOs Embrace Generative AI Despite Risks: MIT and Databricks Report

The Future of Data Management: It’s Already Here