Follow Datanami:
November 8, 2019

Why You Need Data Transformation in Machine Learning

Damian Chan

(JNT Visual/Shutterstock)

Thanks to machine learning and the advancements in software and technology, enterprises can now process and understand their data much faster using modern tools with established algorithms. This effectively allows them to deliver more powerful marketing campaigns, deploy efficient logistics operations, and significantly outpace competitors. But enterprise data can be convoluted and messy in its raw state. This means some form of data transformation is required prior to any data analysis to help you achieve business use cases like the ones mentioned above.

Simply put, data transformation makes your data useful. Data transformation is the process in which you take data from its raw, siloed and normalized source state and transform it into data that’s joined together, dimensionally modeled, de-normalized, and ready for analysis.  Without the right technology stack in place, data transformation can be time-consuming, expensive, and tedious. Nevertheless, transforming your data will ensure maximum data quality which is imperative to gaining accurate analysis, leading to valuable insights that will eventually empower data-driven decisions.

Building and training models to process data is a brilliant concept, and more enterprises have adopted, or plan to deploy, machine learning to handle many practical applications. But for models to learn from data to make valuable predictions, the data itself must be organized to ensure its analysis yield valuable insights.

Garbage In, Garbage Out

Both artificial intelligence and machine learning business use cases need vast amounts of data to train the algorithms. For the most accurate results – the ones that you want to base insights-driven decisions on –  that data needs to be in an analytics-ready state. The data should be joined together, of the highest quality, and embellished with appropriate metrics that the algorithms can use.

When it comes to machine learning, you need to feed your models good data to get great insights, and in most cases, some sort of data cleansing needs to be performed prior to any data analysis. This is a critical step as it ensures data quality, which increases the accuracy of predictions.

As volumes and sources of data increase, and the cost of high-powered computing become more affordable, large datasets can be used to train algorithms and generate predictions. Artificial intelligence uses learnings from data to make a computer or technology stack more human, allowing the automation of tasks without human intervention. Organizations in various industries can improve on automated tasks in real-time. Machine learning and artificial intelligence are used for popular applications, including identifying financial fraud, spotting opportunities for investments and trade; and for driverless cars, speech recognition, robotics, and improving customer service.

To process and understand data insights that enable the promise of machine learning and artificial intelligence alike, models need to consume clean data sets all while keeping up with new incoming data. Make sure to look for outliers in your datasets as this will skew the output of your jobs. Without checking the quality of your datasets, you won’t get an accurate result from the machine learning job – this will make it difficult to make good business decisions.

The challenge to enterprises, therefore, is to transform their data, even as their data increases in volume, variety, and velocity. The cloud, which enables data harnessing and use, has fundamentally altered the way businesses manage and store their data. To overcome and unlock the potential of big data, a business should fully leverage the power of the cloud, and consider deploying data transformation purpose-built for the cloud.

But First, Data Transformation

Before data can be processed within machine learning models, there are certain data transformation steps that must be performed.


  • Remove unused and repeated columns – handpicking the data you need will improve the speed at which your model trains, as well as your analysis.
  • Change data types – using the correct data types helps save memory usage, and can be a requirement – such as making numerical data an integer – for calculations to be performed against it.
  • Handle missing data – resolving incomplete data can vary depending on the dataset. If a missing value doesn’t render its associated data useless then you may want to consider imputation – the process of replacing the missing value with a simple placeholder, or another value, based on an assumption. If your dataset is large enough, you can likely remove the data without incurring a substantial loss to your statistical power. Proceed with caution as you may inadvertently create a bias in your model, but not treating the missing data can also skew your results.
  • Remove string formatting and non-alphanumeric characters – removing characters like line breaks, carriage returns, white spaces at the beginning and end of values, currency symbols, etc. Also, consider word-stemming. While removing formatting and other characters makes a sentence less readable for humans, this approach helps an algorithm better digest the data.
  • Convert categorical data to numerical – many machine learning models require categorical data to be in a numerical format, requiring conversion of values such as yes or no to 1 or 0. Be cautious not to accidentally create order to unordered categories such as converting mr, miss, and mrs to 1, 2 and 3.
  • Convert timestamps – timestamps are in all types of formats; it’s a good idea to define data/time format and convert all timestamps to the defined format.

Actionable Insights Courtesy of Machine Learning

Machine learning can help your business process and understand data insights faster – empowering data-driven decisions to be made across your organization. With the advances in technology and the power of cloud computing, almost every business can take advantage of machine learning in a cost-effective and agile manner without sacrificing speed and performance.

As the quality of your data increases, you can expect the quality of your insights to increase as well. Transforming data for analysis can be challenging based on the growing volume, variety, and velocity of big data, but it is worth it as businesses continue to use data and insights to innovate and grow. This challenge will need to be overcome to unlock the potential of your data and to mobilize your business to move faster and outpace competitors.

About the author: Damian Chan is an experienced data engineer and finance enthusiast with a passion for big data. Damian serves as a solutions engineer at Matillion, a provider of data transformation software for cloud data warehouses. His previous professional work includes building algorithmic systems for Seer Trading Systems where he was exposed to the stock, commodities, and foreign currency exchange market. He has led big data ingestion and deployment and is proficient in cloud data warehouse technologies. 

Related Items:

Can We Stop Doing ETL Yet?

How ML Helps Solve the Big Data Transform/Mastering Problem

Automating the Pain Out of Big Data Transformation