March 14, 2013

Data Athletes and Performance Enhancing Algorithms

Isaac Lopez

Machine learning and big data are combining to reinvent what we know about science and the nature of computing itself.

Recently, Jeremy Howard, president and chief scientist at Kaggle, discussed the impact that machine learning is having in the field, where big data and a phenomenon called “deep learning” are combining for some groundbreaking results.

Howard’s company Kaggle, facilitates machine learning competitions, connecting companies (and their data) with programmers who have a passion for this type of competitive coding. Like the Tour de France for data scientists, the competitions are rigorous marathons of coding, running for 90 days at a time, with its entrants submitting results once per day. Kaggle tracks progress day to day, and provides a leaderboard for which contestant is closest to the target.

One example of this style of competition was launched by Netflix in 2009. Pursuing a prize purse of $1 million dollars, coders were challenged to come up with a more accurate movie recommendation system for users of the Netflix service. Over 50,000 data athletes in 186 countries were given a partial data set for a Netflix account, and were asked to put together an algorithm to predict what was missing. The winning team, BellKor’s Pragmatic Chaos, beat the Netflix system by over 10%.

“That was a very successful competition, and it led to all kinds of interesting mathematical and applied math breakthroughs,” says Howard. Not long after that competition, Kaggle was founded, and Howard is no stranger to seeing predictive modeling breakthroughs as the competitions have progressed.

In a recent talk, loftily titled “Deep Learning: The Biggest Data Science Breakthrough of the Decade,” Howard discussed one such breakthrough that happened during a chemoinformatics competition sponsored by the pharmaceutical company, Merck. Without getting too far into the weeds of the competition itself, the goal of the chemoinformatics competition was ultimately to optimize discovery by using predictive analytics on drug creation.

The breakthrough came from a student team lead by Dr. Geoffrey Hinton, who was utilizing a branch of machine learning dubbed “deep learning.”  With a team that joined the competition as it was nearing its end, armed with deep learning, Dr. Hinton’s team vaulted its way to the top of the leaderboard in unprecedented bounds, and in the end ran away with the competition.

“Chemoinformatics is something that has received thousands of man years of research,” explains Howard in describing the significance of the feat. “A lot of very smart people have worked on it, and yet, using deep learning, a better algorithm was developed than has ever been developed before.”

Not bad for a piece of software that had no specific knowledge about chemoinformatics, and relied only on the provided datasets and a team of students to retrofit it.

“What they’ve done is use a general purpose algorithm, which in the past has been used effectively in things like speech recognition and object detection, and they actually turned it into a general machine learning tool which can do its own feature engineering for arbitrary new problems,” marvels Howard.

The implications are broad, says Howard, who notes that traditionally, the hardest part of machine learning competitions is building sophisticated machine learning models that attempt to use past data to predict the future. “That bit was done nearly automatically using deep learning,” comments Howard.

With breakthroughs like this taking place as machine learning and big data integrate to model better outcomes for the future, Howard reflects on trends that he is witnessing in the computing landscape.

The first trend, says Howard, is the move away from expertise and towards the data. While on some levels, this may sound counter-intuitive, Howard explains that of the hundreds of machine learning competitions that they’ve run at Kaggle, in nearly 100% of the cases, the winner of the competition was someone who didn’t have domain specific knowledge.

“Really, there have been so many examples in the last 20 years of industry assumptions that have claimed to be expertise that turned out to be wrong, and they get replaced by actual data driven decision-making,” commented Howard.

This has some pretty broad implications says Howard, citing the “deep learning” algorithm. “The whole field of chemoinformatics has been pushed to another level, not by people who studied molecular binding in depth, and not by people who had spent a lot of time fine tuning an algorithm for the purposes of using it in chemoinformatics, but by a general purpose algorithm.”

The second trend, says Howard, is the move away from simple data collection and storage, towards actually using that data effectively by applying appropriate optimizing algorithms to it. It’s true, says Howard, that the bulk of technology vendors get most of their revenues more from data storage, data querying, and integration services, but for companies to get value out of that, notes Howard, they need to actually be using these algorithms to make the most use of the data. “We’re definitely seeing, even in the past few months, an increasing intensity on developing appropriate algorithms.”

Finally, and perhaps the most profound trend, is that the algorithms that are being developed are shifting away from being man-made algorithms to machine learnt algorithms. Howard explains that in the past, expert systems were the trend. These are systems, says Howard, “where lots of experts were interviewed and asked how they made decisions, and then there was an attempt to write computer programs that reflected that knowledge.”  According to Howard, a pronounced shift is happening away from this model.

“Increasingly, we’re finding with machine learning, it gives us more accurate models more quickly because rather than being built on top of theory, they’re being built on top of actual empirical data,” comments Howard. “We’re definitely seeing machine learning increasingly replacing man made algorithms.”

Related items:

Hellerstein: Humans are the Bottleneck 

Data Science and the Decision-maker in the Machine 

Big Data Backs World’s Largest Lie Detector