Follow Datanami:
June 11, 2015

Can Smarter Machines End the Pain and Expense of Data Wrangling?

Sharmila Mulligan

(agsandrew/Shutterstock)

Like Alan Turing’s vision to create smarter machines to crack the Enigma Code in World War II, we now sit at a critical juncture to solve the significant pain and expense of data wrangling that most big companies face with smarter machines to blend and prepare big data. Up to 80 percent of data experts’ time is spent on converting data into a usable form instead of having more time for modeling, analysis and collaboration. And like Turing and team’s conundrum, the data is always changing.

Many think that machine learning is the next big frontier in data intelligence. Some of the world’s largest software and data intelligence companies aim to solve data wrangling and faster time to insights with smarter machines – including Amazon, Google and Microsoft. In early April, Amazon Web Services announced a new service for machine learning that’s smart enough to make predictions. Amazon claims that it now takes only 20 minutes to solve one problem that previously took 45 days. Microsoft’s Azure Machine Learning cloud service just went into general availability in February.

Data Wrangling Is Costly and Complex

First, the time and effort sunk into data wrangling results in major delays to reaching business answers. Worse yet, the answers typically aren’t what the business expected. Complex, lengthy cycles with underwhelming results just don’t cut it in competitive business environments.

Second, the data variety and data volumes that organizations want to tap into, have become too complex, and moreover the data itself is too fast-changing; so relying on people and expert skills to wrangle, blend, and model data is an insurmountable problem that cannot be solved by continuing to throw more people at the problem.

There’s a strong business need to create smarter machines that automate data preparation and blending. Breakthroughs in the computer science field promise to save companies billions of dollars through less time spent on data wrangling and much faster time-to-insights to steer the business forward.

A Blended Man-Machine Approach Emerges

The solution to the costly and widespread data wrangling problem is a blended approach of man and machine.

To fully automate data modeling and blending with smarter machines requires artificial intelligence that simply doesn’t exist today. The reality is embedded machine learning on its own won’t succeed here because data is unpredictable and complex. It requires human expertise and curation to get to the business insights needed by organizations.Man-Machine-Brain

Mixed in with human expertise, smarter data machines can solve about 70 percent of the data wrangling problem. Here’s a high-level overview of that smarter machine:

  1. A smart data inference engine coupled with smart data blending requires that it can read data attributes, assign probabilities to the accuracy of what’s interpreted in the data, and use all this metadata to prep and blend data without needing humans to do it.
  2. The ability for machines to learn patterns and infer what data sets are similar enough to be blended together can be automated to a certain degree, requiring no human intervention. The more data you run through a smart machine, the more likelihood it will recognize patterns and assign probabilities for best data fits for smarter auto mashups.
  3. To do this at breakneck speed and enterprise scale, the data inference and blending algorithms require a modern, fast-cycle data processing engine like Apache Spark.

One use case to help bring this to life is a beverage marketer’s conundrum of figuring out which packaging, promotion and pricing works best across various chains of retail stores across the country. To solve this type of business analysis requires a massive blend of data from dozens of SKUs such as six packs, cases, two-liter bottles, restaurant syrups and more. A smart machine can learn what patterns to match up to automate data prep and blending to solve this complex problem quickly so the beverage marketer can optimize the sales and channel offerings based on data intelligence.

Smart Machines Can Ease Shortfall Of Data Scientists

Beyond the huge financial and business implications, industry pundits forecast a big shortfall in data expert talent across industries. A McKinsey study predicts that by 2018, the U.S. alone faces a shortage of 140,000 to 190,000 people with analytical expertise and a 1.5 million shortage of skilled managers who know how to understand and make decisions based on analysis of big data.

data scientist

Data scientists are in short supply

Although a machine can’t solve 100 percent of the data wrangling and blending problem because data is unpredictable and complex, solving 70 percent of the problem through machine automation is a big leap forward.

Why 70 percent and not more? To start it’s possible that automation will be most accurate on semi-structured and structured data, while ad hoc user-generated data (such as user inputs into web forms where data quality is often not accurate or incomplete) or highly unstructured data may require some pre-prep to shape the data into a machine-readable form. That’s not to say that ad hoc user-generated data or unstructured data doesn’t lend itself to more automation over time, but in the near-term it presents the most challenges for accurate machine-readability.

Ultimately, a machine with semantic recognition capabilities can learn patterns with a high degree of accuracy as more data is ingested and processed, whether the data is semi-structured or structured. In the case of highly unstructured data, machines can be trained to recognize patterns and derivations of a pattern, but the first step in this scenario may involve some human intervention and expertise.

New Data Formats and Human Language Ambiguity Are Challenges

As Steve Lohr wrote in an article in The New York Times last year on the topic: “data formats are one challenge, but so is the ambiguity of human language.” Specifically, Lohr is referring to the fact that data will always contain custom, company-specific attributes that may not be recognizable immediately by a machine. Due to this conundrum, the first step may involve human intervention and confirmation that what’s inferred by the machine, is actually accurate or needs correcting, before the machine can take over learning and managing the pattern.

Machines can go a long way to automate what’s been the most painful and costly part of harnessing big data – by either learning trained patterns or using already-known attributes and patterns. Data wrangling is not a new problem. It’s plagued companies of all kinds for decades. But as data explodes in volume and diversity, the need for machines to automate data wrangling becomes urgent.

While we’ll never hit 100 percent automation even with great advances in data analysis and automation. New types of data get generated from devices of the future will create an ever-moving finishing line that makes 100 percent automation unachievable.  However, we’re definitely on our way and have made impressive strides with a majority of automation that can be achieved through data-intelligent machines.

Sharmilla_MulliganAbout the author: Sharmila Mulligan is the CEO and founder of ClearStory Data, a big data analytics startup based in Menlo Park, California. Sharmila has spent 18-plus years building game-changing software companies in a variety of markets. She has been EVP & CMO at numerous software companies, including Netscape, Kiva Software, AOL, Opsware, and Aster Data. She drove the creation of several multi-billion dollar market categories, including application servers, data center automation, and big data analytics. She is on the board of Hadapt and Lattice Engines, advisor to numerous companies, large and small, and an active investor in early stage companies.

 

Related Items:

How Machine Learning Is Eating the Software World

The 3 Key Steps to Building a Predictive App with Machine Learning

From Data Wrangling to Data Harmony

Datanami