March 14, 2013

Data Athletes and Performance Enhancing Algorithms

Isaac Lopez

Machine learning and big data are combining to reinvent what we know about science and the nature of computing itself.

Recently, Jeremy Howard, president and chief scientist at Kaggle, discussed the impact that machine learning is having in the field, where big data and a phenomenon called “deep learning” are combining for some groundbreaking results.

Howard’s company Kaggle, facilitates machine learning competitions, connecting companies (and their data) with programmers who have a passion for this type of competitive coding. Like the Tour de France for data scientists, the competitions are rigorous marathons of coding, running for 90 days at a time, with its entrants submitting results once per day. Kaggle tracks progress day to day, and provides a leaderboard for which contestant is closest to the target.

One example of this style of competition was launched by Netflix in 2009. Pursuing a prize purse of $1 million dollars, coders were challenged to come up with a more accurate movie recommendation system for users of the Netflix service. Over 50,000 data athletes in 186 countries were given a partial data set for a Netflix account, and were asked to put together an algorithm to predict what was missing. The winning team, BellKor’s Pragmatic Chaos, beat the Netflix system by over 10%.

“That was a very successful competition, and it led to all kinds of interesting mathematical and applied math breakthroughs,” says Howard. Not long after that competition, Kaggle was founded, and Howard is no stranger to seeing predictive modeling breakthroughs as the competitions have progressed.

In a recent talk, loftily titled “Deep Learning: The Biggest Data Science Breakthrough of the Decade,” Howard discussed one such breakthrough that happened during a chemoinformatics competition sponsored by the pharmaceutical company, Merck. Without getting too far into the weeds of the competition itself, the goal of the chemoinformatics competition was ultimately to optimize discovery by using predictive analytics on drug creation.

The breakthrough came from a student team lead by Dr. Geoffrey Hinton, who was utilizing a branch of machine learning dubbed “deep learning.” With a team that joined the competition as it was nearing its end, armed with deep learning, Dr. Hinton’s team vaulted its way to the top of the leaderboard in unprecedented bounds, and in the end ran away with the competition.

“Chemoinformatics is something that has received thousands of man years of research,” explains Howard in describing the significance of the feat. “A lot of very smart people have worked on it, and yet, using deep learning, a better algorithm was developed than has ever been developed before.”

Not bad for a piece of software that had no specific knowledge about chemoinformatics, and relied only on the provided datasets and a team of students to retrofit it.

“What they’ve done is use a general purpose algorithm, which in the past has been used effectively in things like speech recognition and object detection, and they actually turned it into a general machine learning tool which can do its own feature engineering for arbitrary new problems,” marvels Howard.

The implications are broad, says Howard, who notes that traditionally, the hardest part of machine learning competitions is building sophisticated machine learning models that attempt to use past data to predict the future. “That bit was done nearly automatically using deep learning,” comments Howard.

With breakthroughs like this taking place as machine learning and big data integrate to model better outcomes for the future, Howard reflects on trends that he is witnessing in the computing landscape.

The first trend, says Howard, is the move away from expertise and towards the data. While on some levels, this may sound counter-intuitive, Howard explains that of the hundreds of machine learning competitions that they’ve run at Kaggle, in nearly 100% of the cases, the winner of the competition was someone who didn’t have domain specific knowledge.

“Really, there have been so many examples in the last 20 years of industry assumptions that have claimed to be expertise that turned out to be wrong, and they get replaced by actual data driven decision-making,” commented Howard.

This has some pretty broad implications says Howard, citing the “deep learning” algorithm. “The whole field of chemoinformatics has been pushed to another level, not by people who studied molecular binding in depth, and not by people who had spent a lot of time fine tuning an algorithm for the purposes of using it in chemoinformatics, but by a general purpose algorithm.”

The second trend, says Howard, is the move away from simple data collection and storage, towards actually using that data effectively by applying appropriate optimizing algorithms to it. It’s true, says Howard, that the bulk of technology vendors get most of their revenues more from data storage, data querying, and integration services, but for companies to get value out of that, notes Howard, they need to actually be using these algorithms to make the most use of the data. “We’re definitely seeing, even in the past few months, an increasing intensity on developing appropriate algorithms.”

Finally, and perhaps the most profound trend, is that the algorithms that are being developed are shifting away from being man-made algorithms to machine learnt algorithms. Howard explains that in the past, expert systems were the trend. These are systems, says Howard, “where lots of experts were interviewed and asked how they made decisions, and then there was an attempt to write computer programs that reflected that knowledge.” According to Howard, a pronounced shift is happening away from this model.

“Increasingly, we’re finding with machine learning, it gives us more accurate models more quickly because rather than being built on top of theory, they’re being built on top of actual empirical data,” comments Howard. “We’re definitely seeing machine learning increasingly replacing man made algorithms.”

Data Science and the Decision-maker in the Machine

Big Data Backs World’s Largest Lie Detector

Applications: Artificial Intelligence, Predictive Analytics

Technologies: Frameworks, Systems

Sectors: Academia, Biosciences, Healthcare, Other, Science

Tags: BellKor’s Pragmatic Chaos, chemoinfoinformatics, data driven decision-making, deep learning, Geoffrey Hinton, Jeremy Howard, kaggle, machine learning, Merck, Netflix, predictive analytics

Only registered users may comment. Register using the form below.

Check off newsletters you would like to receive*
- HPCwire
- EnterpriseTech
- Datanami
- Technology Conferences & Events
- Advanced Computing Job Bank
- Technology Product Showcase
Email*
Name*
First Last
Organization*
Job Function*
Industry*
Country*
City*
State*
Province*
- Please check here to receive valuable email offers from Datanami on behalf of our select partners.

Data Athletes and Performance Enhancing Algorithms

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

April 24, 2024

April 23, 2024

April 22, 2024

April 19, 2024

Sponsored Partner Content

Get your Data AI Ready – Celebrate One Year of Deep Dish Data Virtual Series!

Supercharge Your Data Lake with Spark 3.3

Learn How to Build a Custom Chatbot Using a RAG Workflow in Minutes [Hands-on Demo]

Overcome ETL Bottlenecks with Metadata-driven Integration for the AI Era [Free Guide]

Gartner® Hype Cycle™ for Analytics and Business Intelligence 2023

The Art of Mastering Data Quality for AI and Analytics

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Top 6 Strategies for Reducing Data Warehouse Costs

Building an Operational Data Warehouse for Real-time Analytics

Sponsored Multimedia

The Power of DataOps: Bring Automation to Life
No Comments

Tactical Steps for Cloud Migration
No Comments

Immuta Data Access Platform
No Comments

Data Mesh: Fact or Fiction?
No Comments

Contributors

Featured Events

AI & Big Data Expo North America 2024

AI Hardware & Edge AI Summit Europe

AI Hardware & Edge AI Summit 2024

CDAO Government 2024

Data Athletes and Performance Enhancing Algorithms

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

April 24, 2024

April 23, 2024

April 22, 2024

April 19, 2024

Most Read Features

Most Read News In Brief

Most Read This Just In

Sponsored Partner Content

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Sponsored Multimedia

Contributors

Featured Events

Share

Copy short link