Follow Datanami:
June 22, 2020

Solving the Challenge of Integrating Messy Data

The Challenge

Your team manages several analytics projects each with different requirements, data sources, data structures, and other peculiarities. For each new project, you kick off the process of gathering the data, getting to know the formats and accessing the contents of the data, preparing that data for analysis by cleaning and standardizing problem areas, validating the output, and then uploading that data to your analytics platform to provide valuable insights back to the stakeholder. Far and away the biggest challenge throughout this process of integrating data, is understanding and preparing the messy data. Not only is this process painful and inefficient but it’s also the single biggest limiting factor to scaling your operations and becoming a more efficient team. Sound familiar?

Many teams rely on Excel to do this work. The data comes in as files or extracts provided from a customer or IT team and then your team opens the various datasets in Excel. This begins the time-consuming process of scrolling  through the contents of the data, identifying data quality issues, understanding the columns and column formats available, and then starting to blend, clean, and reformat that data to the exact spec needed for your analytics. After doing this several times with the same data sources, you start to get comfortable with that particular data; but it still requires a manual and tedious process each time you need to update the analytics with the new batch of data. Larger files are particularly frustrating as each new step in Excel takes seconds or minutes to load, or crashes the whole application.

Additionally, this requires your team to go through the same preparation steps each time you want to refresh the analytics. It’s extremely time consuming to onboard this data the first time, but you only save a little bit of time through familiarity when you have to repeat this process over and over again. Many teams who rely on Excel have to dedicate a large percentage of each of their team members’ work days to repeatedly preparing data in Excel. This manual process has led many analytics providers to seek tools to automate this process. Most common of which are programming tools like Python or using a combination of SQL and a scheduler. For each new customer, these coding platforms allow your team to build a script that can then be automated and used for each new batch of data. Trouble is, Python and SQL require a lot of technical expertise, and can be difficult even for seasoned users to understand the contents of the data, especially when exploring a messy dataset for the first time. You might think to pass this scripting work off to the engineering team, but they are already busy with software development and cannot possibly hit the time tables you require.

So, what do you do?

Trifacta provides the ease of use of tools like Excel paired with the automation capabilities of tools like Python–with added visualizations and interactive data quality information to make discovering the contents of one’s data quick and easy. Each step created in Trifacta during the preparation process is stored in a recipe, which can be scheduled to create a new self-service data pipeline for each project. Trifacta accelerates the time it takes to discover the contents of each dataset, prepare that data, and publish that data to the analytics platform, but the real value and time savings comes from automating the work done in Trifacta. Once a recipe is created to take data from raw to refined, that recipe can be scheduled to run hourly, daily, weekly, or whatever time interval is needed to ensure that your customers are getting up to date analytics whenever they need it.

Interested in trying Trifacta for yourself? Sign up for free today.