Follow Datanami:
June 15, 2020

The Data School with Professor Joe Hellerstein: The Role of AI in Data Prep

 

What is AI’s role in the data preparation process? It doesn’t take much more than asking Siri to clean your data to realize that we can’t sit back and let AI take care of all of our data preparation needs. In session 3 of the Data School with Professor Joe Hellerstein, Joe takes a look at how human intelligence, and artificial intelligence, can work together to make data cleaning easy and intuitive.

Here’s what we do know–traditional data transformation interfaces are incredibly tedious to work with, asking users to create complex code or use IT power tools to work with data in a scalable manner. On the other side of the spectrum, spreadsheets provide an easy to use interface but lack scale and governance.

There must be something between “siri clean my data” and a 600 line python code to solve this problem. We want a medium that is natural for people to immerse themselves in their data. Speech is probably not that medium, but what about the visual medium? Things like spreadsheets and dashboards are familiar to those who work with data. The visual medium provides a great foundation for seeing how your data changes as you clean, structure and blend it, and also gives visual cues to interact with. This interaction is exactly where AI comes into the equation.

Let’s take an example, you have a date column with multiple different formats, some rows resemble 03/17/15 and others are like 17-Nov-2017. You only noticed this after loading your data into your favorite visualization tool and noticed that a large group of rows in your date column show up as null. Using a traditional data transformation interface, best case scenario you can select an edit column block and then create a complex if/then or case statement using regular expressions that identifies numerous different conditions where the first 2 digits are the day, followed by a “-”, followed a three letter string, followed by another “-”, then followed by a four digit year, that you then want to reformat to a different pattern that you specify using more regular expressions. And once you have that in place, you have reformatted one of the date formats in your column. What happens when you have 4 different formats? There goes your Wednesday on just this one task.

What happens if we add some visualizations and AI to solve this problem? We can quickly identify when there are multiple formats, which would cause an issue in your analytics. Simply clicking on the column that has funky data, identified by a visual indicator, allows the AI to provide a ranked list of suggestions on what you might want to do to resolve your issues. You can clean up mismatched dates with just a couple of clicks rather than lines and lines of complex code, saving hours of time and frustration.

click to enlarge

This is what Trifacta is interested in solving. Pairing human intelligence with artificial intelligence to make understanding your data, cleaning your data, blending your data, and all of the nitty gritty tasks of wrangling your data easy and intuitive. Make sure to watch the full video above to see how AI can significantly improve the experience of cleaning your data!

Datanami