October 16, 2015

Avoiding the Pitfalls of Bigger Data at the Human-Machine Interface

Alex Woodie

One of the widely held misconceptions in the field of big data analytics is that you can scale your way into insights by just adding more data. That may be true in some situations, but as 538 Editor in Chief Nate Silver and Crowdflower CEO Lukas Biewald said at this week’s inaugural Rich Data Summit, it’s just not that simple.

Silver led off his keynote address at Wednesday’s Rich Data Summit with a reference to the 2008 story “The End of Theory: The Data Deluge Makes the Scientific Method Obsolete,” written by Wired magazine editor Chris Anderson. Silver, the statistician and former baseball analyst who wrote the 2012 book “The Signal and the Noise,” had good things to say about Anderson in general, but not this particular story.

“This is the idea that it’s all about brute force and volume, that it’s a computational problem, that if you have millions or billions or trillions of observations and you have powerful enough computers then eventually correlations will rain down from the sky and you discover truth through brute force alone,” Silver said.

“This idea, I think, is kind of that big data is magic, in other words, and that it’s push-button, where you get your big data, you push a button, and all of a sudden you have extremely valuable output,” he said. “This idea is very wrong, and a little bit dangerous.”

It was ironic that, just after Anderson’s story came out, we witnessed the near-collapse of the nation’s financial system in the fall of 2008, when the effects of “a series of complex algorithms that the banks used” came home to roost, Silver said. “These are all examples of predictions that were all substantial failures of some kind,” he said “We’re getting the little stuff right more and more, but we still have all types of issues.”

Confusion in the Noise

The demand to use big data for a competitive advantage in commercial fields has never been greater, but all too often the practitioners get distracted by noise in the data, instead of concentrating on the signal, as thin and hard-to-hear as it may be. As more data becomes available, there’s a temptation to utilize it, and to add more variables to the algorithm, in hopes of extracting more signal.

Nate Silver famously used analytics to successfully call the outcomes in 49 of the 50 states in the 2008 U.S. Presidential election Photo: Wikipedia

But when dozens or hundreds or thousands of signals are brought in—or hundreds of thousands, such as the 227,000 time-series data points published by the Federal Reserve Economic Research (FRED)–it becomes exceeding difficult to isolate the signal, and detect or correct for any relationships that exist between any two sets of data, Silver says.

“I live in New York now, and we have a saying that if you’re one in a million in New York, there are seven people just like you,” he said. “It’s the same in datasets that are this large, where you can find a ‘statistically significant’ correlation that has a one in 1 million probability of being a fluke. But in a data set of this [FRED] volume, you have about 30,104 one-in-a- million lottery tickets for those, and if you’re not careful, you could be staking your business on what is essentially a bug instead of a feature.”

Such warnings about the dangers of big data fallacies are nothing new, and indeed Silver has made a good living preaching the need to bring a Bayesian sensibility to such questions (as well as being the respected editor of Five Thirty Eight.com, which surely is a great gig).

But as the host of the Rich Data Summit, Crowdflower founder and CEO Lukas Biewald points out, data science starts with the data, and sometimes the size of data can make a big difference.

Big Data Sirens

Biewald, who comes from a machine learning background and worked on the data science teams at Powerset and Yahoo Japan before co-founding CrowdFlower with fellow Powerset worker Chris Van Pelt, said large data volumes can, indeed, be conducive to boosting signal, which is something that executives at many companies aren’t aware of.

shutterstock_Dmitry Nikolaev_computer code

Dmitry Nikolaev/Shutterstock.com

Biewald showed the Rich Data Summit audience–about 300 people at The Summit on Market Street in San Francisco–an example of how the error rate of an algorithm decreased by 50 percent every time the data set doubled. An average algorithm working against a data set with 10,000 data points stated out with an error rate of 20 percent, and as the data set grew to 40,000 data points, the error rate dropped to 5 percent.

“Having more data can be just as effective as the most powerful algorithm,” Biewald said. “But just saying data science needs more data is an oversimplification. Data science really needs a particular kind of data.”

That particular kind of data, of course, is rich data (which, of course, is the name of the one-day conference). Biewald then quoted Silver on the topic. “‘It’s not just big data,'” Silver wrote in a February blog post titled “Rich Data, Poor Data.” “‘It’s something much better. Rich data. By rich data, I mean that it’s accurate, concise and subject to rigorous quality control.'”

“This is why we started CrowdFlower,” Biewald continued, “because I wanted rich data and our goal is to get you rich data. And this is why we have this conference because we want to make rich data availed to everyone so we can make amazing algorithms and amazing analysis.”

Biewald then showed the crowd another chart similar to the one before. Only instead of the error rate dropping as more data is added to the equation, the error rate drops when rich data is added. “This is eerily similar to doubling the amount of the data,” Biewald said. “Just cleaning up the data is as valuable as collecting twice as much data, four times as much data.”

Human-Machine Learning

shutterstock_Willyam Bradberry_human-robot-shandshake

Willyam Bradberry/Shutterstock.com

CrowdFlower, if you’re not familiar, provides a service that leverages the power of crowdsourcing to help data science initiatives by providing them with clean, human-curated data that they can use to train machine learning algorithms. At the show, CrowdFlower launched a new solution called CrowdFlower AI that moves the company beyond just being a source of human eyes to classify data to more accurately train the machines.

Here’s how CrowdFlower AI works in a nutshell. The customer selects machine learning algorithms from Google, IBM Watson, or Metamind to use on a given set of data. When the algorithms start doing a lousy job on a particular piece of data, they’ll automatically kick it back to CrowdFlower to get human eyes on the problem.

“This is the right design pattern for doing machine learning,” Biewald said. “One thing that still holds data science and especially machine learning back… is that last 4 percent error rate. It’s hard for me to clean up my data to get 100 percent accuracy. At some point the best algorithms have diminishing returns and getting more training data has diminishing returns.”

Silver agrees that the combination human sensibility and analytic machines are the ideal combination. In his talk and in his book, he analyzes Gary Kasparov’s epic chess battle against IBM’s Deep Blue. The supercomputer ultimately won–although Silver seems to think Kasparov might have won if he wasn’t so spooked by some poor and bizarre moves Deep Blue made early on that Kasparov wrongly misinterpreted as the work of a devious genius.

The ideal route, Silver said, is to start out a problem with a probabilistic Bayesian approach, and then enhance it with human intuition as needed. “There’s lots of evidence in fields like chess and weather forecasting… [that] humans can add value relative to the best algorithms in the world,” Silver says. “…[Y]ou’ve got to be as disciplined and algorithmic as possible for the first 80 percent, and then at the end, the last 20 percent is when you use your common sense instead.”

Training Day: CrowdFlower Sets Human-Generated Data Free

Nate Silver Warns Against Big Data Assumptions

(feature art: lightspring/Shutterstock.com)

Applications: Artificial Intelligence, Enterprise Analytics

Technologies: Middleware

Sectors: Other

Vendors: CrowdFlower, google, IBM

Tags: big data, CrowdFlower, machine learning

Only registered users may comment. Register using the form below.

Check off newsletters you would like to receive*
- HPCwire
- EnterpriseTech
- Datanami
- Technology Conferences & Events
- Advanced Computing Job Bank
- Technology Product Showcase
Email*
Name*
First Last
Organization*
Job Function*
Industry*
Country*
City*
State*
Province*
- Please check here to receive valuable email offers from Datanami on behalf of our select partners.