March 26, 2014

Forget the Algorithms and Start Cleaning Your Data

Alex Woodie

The idea that the combination of predictive algorithms and big data will change the world is a tempting one. And it may end up being true. But for right now, the industry is facing a reality check when it comes to big data analytics. Instead of focusing on what algorithms to use, your big data success depends more on how well you cleaned, integrated, and transformed your data.
The dirty little secret of big data analytics is all the work that goes into prepping and cleaning the data before it can be analyzed. You may have the sharpest data scientists on the planet writing the most advanced algorithms the universe has ever seen, but it won’t amount to a hill of beans if your data set is dirty, incomplete, or flawed. That’s why up to 80 percent of the time and effort in big data analytic projects is spent on cleaning, integrating, and transforming the data.

As the data scientist heading up Cloudera’s Oryx project to create a new machine learning framework, Sean Owen lives and breathes algorithms. But even Owen wonders if the industry’s fascination with algorithms has gone a bit too far.

“It’s interesting that a lot of the conversation amongst these startups seems to be who’s got more algorithms, who’s got the faster algorithms, who’s got the better accuracy on a given data set,” he tells Datanami. “These are all important in some context, but I think some of these are missing the more basic needs out there.”

Instead of focusing on algorithms, people ought to be focused on data. “Everybody basically has the same algorithms,” he says. “I have a hard believing there’s a fundamentally different new algorithm that’s going to come out and change the world…..Really, the differentiators from one company to the next is data–who has more data, who has better data. More better data plus a dumb algorithm is going to beat a really good algorithm on bad data.”


Cloudera’s Sean Owen

The folks at RapidMiner have a similar viewpoint. The German-born advanced analytics software provider that Gartner considers a credible threat to the dominance of SAS and IBM‘s SPSS in this field has about 1,500 algorithms in its product, which it claims is the most of any company in its field.

But only about 250 of RapidMiner’s algorithms are aimed at what we might consider “the good stuff”—analyzing data in useful and creative ways. The rest of the algorithms are there to help ease the burden of integrating, cleaning, and transforming the data prior to analysis.

“Algorithms are one thing, but the more important thing is usually how you’re transforming the data and preparing the data before it goes into a predictive model,” Ingo Mierswa, co-founder and CEO of RapidMiner, tells Datanami. “In my 15 years working in this field, I’ve found that it’s much more about finding the right way of preparing the data and making the right variations and explaining those models than the actual algorithms.”

Even the best algorithm will only move the needle a little bit, Mierswa says. “And the big breakthroughs, when it comes to solving a certain problem like churn or predictive maintenance or whatever the business problem is–it’s more of how you are including certain data sets and how are you transforming the data set in a way so it covers more information for the actual predictive algorithm,” he says.

Hadoop analytics vendor Datameer has discovered the same thing about algorithms. “What we’re finding is people are spending 80 percent of time just in getting that data ready, because the data is so raw, because there’s so much of it, and it’s in so many formats,” Datameer’s senior director of product marketing Karen Hsu tells Datanami. “The data is a mess and you have to clean it up quickly. The first step is finding where the problem in your data are, and that’s where profiling comes in.”

Today the company officially unveiled a new visual data profiling function in its version 4.0 release that allows analysts to see where their data might be a little out of whack before running it through one of four general algorithms Datameer provides, including clustering, column dependencies, decision trees, and recommendations. In addition to predictive analytics (which it unveiled last June), the company’s soup-to-nuts analytics stack covers data ingestion and visualization.


Datameer 4.0 gives users visual clues about the quality of their data

The new data profiling function gives users a visual representation of their data set from Datameer’s Excel-like interface. The software uses “visual checkpoints” to communicate information about the data to the user, including data type and counts, and various statistical measure, including maximum, minimum, cardinality, means, and average. The software enables the user to change the automatically make a change, and then inspect the data again before moving onto actual analysis using algorithms.

Hsu says the data profiling function gives customers some of what Trifacta and Paxata offer in their automated data transformation tool sets, but does so within a larger suite that also includes data integration, analytics, and visualization layers. “We’re finding customers don’t want to have 50 tools they have to work with,” she says. “A business analyst wants one environment that they can use for everything.”

By the way, RadidMiner this week announced that it has officially moved its headquarters from Germany to the Cambridge area of Massachusetts. Besides the proximity to top universities, other big data startups, and venture firms, just having a physical presence on U.S. soil will help the company with American prospects, Mierswa says.

Related Items:

Has Dirty Data Met Its Match?

SAS and IBM King of Analytics Hill, But for How Long?

Datameer Reels in $19M as Venture Funds Move Up the Hadoop Stack

Applications: Data Mining, Visualization

Technologies: Middleware

Sectors: Other

Vendors: Startups and More...

Tags: algorithms, data quality, Hadoop, visualization

Only registered users may comment. Register using the form below.

Check off newsletters you would like to receive*
- HPCwire
- EnterpriseTech
- Datanami
- Technology Conferences & Events
- Advanced Computing Job Bank
- Technology Product Showcase
Email*
Name*
First Last
Organization*
Job Function*
Industry*
Country*
City*
State*
Province*
- Please check here to receive valuable email offers from Datanami on behalf of our select partners.

Forget the Algorithms and Start Cleaning Your Data

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

May 13, 2024

May 10, 2024

May 9, 2024

May 8, 2024

Sponsored Partner Content

Get your Data AI Ready – Celebrate One Year of Deep Dish Data Virtual Series!

Supercharge Your Data Lake with Spark 3.3

Learn How to Build a Custom Chatbot Using a RAG Workflow in Minutes [Hands-on Demo]

Overcome ETL Bottlenecks with Metadata-driven Integration for the AI Era [Free Guide]

Gartner® Hype Cycle™ for Analytics and Business Intelligence 2023

The Art of Mastering Data Quality for AI and Analytics

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Top 6 Strategies for Reducing Data Warehouse Costs

Building an Operational Data Warehouse for Real-time Analytics

Sponsored Multimedia

The Power of DataOps: Bring Automation to Life
No Comments

Tactical Steps for Cloud Migration
No Comments

Immuta Data Access Platform
No Comments

Data Mesh: Fact or Fiction?
No Comments

Contributors

Featured Events

AI & Big Data Expo North America 2024

CDAO Canada Public Sector 2024

AI Hardware & Edge AI Summit Europe

AI Hardware & Edge AI Summit 2024

CDAO Government 2024

Forget the Algorithms and Start Cleaning Your Data

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

May 13, 2024

May 10, 2024

May 9, 2024

May 8, 2024

Most Read Features

Most Read News In Brief

Most Read This Just In

Sponsored Partner Content

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Sponsored Multimedia

Contributors

Featured Events

Share

Copy short link