Follow Datanami:
March 4, 2015

Training Day: CrowdFlower Sets Human-Generated Data Free

Data scientists who are looking for high quality sets of curated data on which to train their machine learning models may want to check out CrowdFlower, which today unleashed a veritable treasure trove of free human-generated data.

CrowdFlower today released about 40 data sets as part of its Data for Everyone campaign (see http://www.crowdflower.com/data-for-everyone). But over the coming weeks, the San Francisco company expects to make thousands of data sets available for download from its website, covering millions of records.

“One of the things I really love about CrowdFlower is it makes it really easy for folks to get really high quality data,” says Lukas Biewald, co-founder and CEO of CrowdFlower. “A lot of innovations get slowed down because there aren’t any data sets available. I’m just excited to put out a lot of beautiful data sets for people.”

Biewald is a self-described data guy who deeply believes in the power of data quality. At a previous startup, Beiwald recalls how hard the company worked to get the data that would enable it to accurately test the output of the search engine they were building. “It took us months and months before we were able to test our algorithms,” he tells Datanami. “It actually takes time to collect data as human beings.”

He founded CrowdFlower in 2010 to help streamline that data collection and curation process. Nearly a thousand companies, including EBay, Autodesk, and Unilever, have tasked CrowdFlower’s colony of crowd-sourced workers to painstakingly go through human generated data, clean it up, and make available for data scientists to build recommendation systems, image identification systems, and other data-driven apps.

The data sets CrowdFlower works on are highly specific and very detailed. For example, one of the data sets it’s making available through its Data for Everyone program is a collection of 4,000 tweets about Coachella 2015, the multi-day music festival held every spring in the California desert. CrowdFlower workers have gone through each of the Twitter posts and scored it with a positive, neutral, or negative sentiment.crowdflower-logo

The combination of that scoring and the actual words in the tweet (also contained in the data set) is very useful. “Say you’re building an automated sentiment analysis system,” Biewald says. “You can look at the output of your sentiment system on all these tweets about Coachella and compare them to our gold-standard, human-labeled data set. So if the thing that your system is outputting is the same as our workforce says, that’s great. And if it’s different, then you might worry that your system is having issues.”

Another CrowdFlower customer paid somebody to look at every cover of Time magazine since 1923 and categorize the gender of the person on the cover. That data set, which is also easily accessible from the Data for Everyone webpage, could be useful for people building computer vision systems. “Say Facebook wanted to build a system that automaticity labels the gender of images of people uploaded, Biewald says. “That might be really good training data to build that system.”

Because of the time and expense it takes to curate a high-quality set of human-generated data, they’re not that easy to come by. One of the only publicly available ones came from Netflix. Several years ago, Netflix released a little bit of their data around their customers, which was basically about pairing movies–the “if you liked movie A, then you’d like movie B” kind of thing.

“Basically every person I know who works on recommendation system tested on that Netflix data because it’s the only public available data set,” Biewald says. “At every company I’ve worked at, we’ve been really looking for human labeled data set.  There are a few that are publicly available.”

CrowdFlower CEO Lukas Biewald

CrowdFlower CEO Lukas Biewald

Before releasing Data for Everyone, CrowdFlower consulted a few of its customers about their willingness to share their data, and ended up adding a button on the CrowdFlower platform that would let them do this. “We got surprising amount of buy-in,” Biewald says, “and we started to release what I think is the biggest curated data set of human-labeled data. There are maybe a million records here [in Data for Everyone]. It would take you years to collect even just the data so far we’ve so far by hand.”

The big data revolution has changed how we think about data. Hundreds of years ago, it was reasonable to assume that one could collect all of the printed books in a large library. Today, data is generated so fast that we barely have time to think about it, so off it goes into Hadoop, to make sense of later.

As data gets cheaper and cheaper to store, folks start storing more and more stuff. “But when they actually want to do an analysis or link it up to a BI tool, they realize they have a mess on their hands,” says Biewald, who oversaw a $12.5 million series C funding round for Crowdflower last September, bringing the company’s total funding to $28 million.

“I believe that every data scientist should be using CrowdFlower because every data analysis works better if the data is cleaner or more enriched,” he says. “It’s better to do a bad analysis on good data than a great analysis on bad data.”

Related Items:

How to Get a ‘Network Effect’ from Your Big Data Lake

9 Places to Get Big Data Now

Forget the Algorithms and Start Cleaning Your Data

Datanami