Follow Datanami:
November 11, 2019

What is Data Wrangling?

To start to answer this question, let’s consider the high level objective of most data professionals: take data close to the source, and turn that data into value. This value can be utilized in a few ways. Data can drive important business decisions, like choosing which markets to advertise to, or it can feed data driven systems to provide better product experiences, like recommending shows to watch or people to connect with. Data hardly ever comes ready for use right out of the box, however. The process of taking disparate sources of raw data, discovering and assessing its content, combining this data with other insight rich sources, structuring and cleaning this data for accuracy and consistency, and automating and orchestrating this process for continuous, timely value is necessary for getting value out of your downstream applications. This process is exactly what we mean by Data Wrangling. It is well known that this process of wrangling data accounts for over 80% of the time spent on most data projects. From a sheer time savings perspective, this is where companies can gain the biggest competitive advantage. When considering how important quality data is in analysis and machine learning, it only increases the urgency for addressing the challenge of successful data wrangling practices.

What makes this process so difficult? 

For starters, the common approach is a continuation of a many decades old approach to using data. If we were to take a snapshot of an organization from the 1990s, chances are the approach would look as follows. IT teams managed siloed off data centers containing highly structured, transactional data. Business teams were tasked with analyzing commercial and operational efficiency and performance. When business teams needed data for their analyses, they would send a spec to IT, IT would log it into the queue of requirements and return with data a few weeks later. If the data met the requirements, great, some form of an ETL process that curated that data for analysis could be locked down into production. If not, which was more often the case, this back and forth process of trading specs for prototypes would continue.

Jump ahead 30 years and everything has changed… Well, almost everything. The volume and variety of data has exploded. Data has shifted from transactions to interactions. Cloud platforms now allow for maximum scalability with low cost storage, meaning organizations can store large volumes of raw data, in varying formats, in cloud data lakes and agile, scalable cloud data warehouses. Advances in open source technology and algorithms mean organizations have far more insight into customer behavior and greater capabilities for deriving value from data. Yet some estimate as little as 1% of today’s data gets analyzed. The decades old approach to wrangling data simply cannot keep up with the changing paradigm.

Where do we go from here?

In order to keep up with this rapidly changing landscape, organizations need to adopt a strategy that focuses on agility and self service. Rather than silo off an IT unit to create and maintain ETL pipelines that are rigid and slow to adapt to business needs, organizations should embrace technologies that empower the line of business to get their hands on the data they know best. Modern Data Wrangling platforms like Trifacta enable end users to connect to data close to the source, refine it for analysis, and collaborate with data engineers and IT teams to automate and orchestrate data pipelines for continuous value. Pairing visual and machine learning guidance, the code free-interfaces give users immediate clarity on the contents of the data, guidance on creating preparation steps, continuous validation each step of the way, and the ability to operationalize their work in a cloud native platform that interoperates with other tools in the stack.

Try for yourself today with a free trial of Trifacta.