Big Data Dispelling Preconceived Notions in the NFL
One of the most commonly used examples when discussing what in the world “big data” is, is the now famous Moneyball case study, where the Oakland Athletics major league baseball team was able to use analytics to break the game down in ways that previously hadn’t been done before. With the NFL season bearing down on us, we ask the question “what about the NFL?”
In an OSCON session last month, one data scientist discussed a project that he’s started to examine NFL data to find some of the correlations hidden in the data. We caught up with Jesse Anderson, a curriculum developer and instructor at Cloudera, to discuss his NFL stats project, and determine how much cold cash we could take to Vegas based on his data.
Unfortunately, the answer is, not much. The project, at this point, isn’t predictive, explained Anderson, and to get there would take a considerable amount of work and refinement for anyone to ever get there. For this project Jesse wanted to understand how the game was played from a statistical point of view over the last 10 years, and whether things like severe weather, and arrests had effects on the outcomes.
Using data collected by the website Advanced NFL Stats, Anderson put ten years of NFL play-by-play data into Hadoop to try to extract useful information from the unstructured data. “I spent a good 80% of my time dealing with problems in the data,” he explained discussing the challenges of working with an unstructured data set that contains 2,898 games with 471,392 plays. The biggest challenge he explained, was in the natural language processing, and getting useful data out consistently. He says he used regular expressions to parse out the human-generated strings and extract useful info.
This was a painstaking process, because while the data was predictable in many cases, large amounts of it needed special attention before it would be effectively wrangled. When you’re dealing with almost 500k plays, the data munging effort was considerable.
While the play-by-play was a challenge, there was other data that he wanted added to the system, each with their own challenges. Weather data was collected and entered into the system, giving data on such things as precipitation levels and type, average wind speed, and temperature maximum and minimums. Information on the venue was collected via Wikipedia, adding things such as stadium capacity, playing surface, roof type, and elevation. Additionally, he included arrest records kept by the San Diego Tribune.
Once the data was set, Anderson used Hive and Impala to query the data and see what he could find out about the way the game is played, and what conditions might lead to wins or losses. The theme of his research, he says, was that the data often flies in the face of long established pre-conceived notions.
Debunking the Mile High Altitude Myth and More…
Anyone who watches the NFL has seen the images of the players on the sidelines huffing oxygen through masks, while the announcers dramatize the images with talk about the advantage that the Denver Broncos have in their mile high home field. According to the data, the altitude doesn’t really show any discernible effect in either the outcome or how the game is played relative to other stadiums, saving one minor difference: a 1% increase in passes. “They all pretty much played the game the same regardless of what elevation the game was being played at,” said Anderson, who says that perhaps the altitude used to make a difference (this data is a 10 year scope), but he believes that modern conditioning regiments negate any advantage that might have otherwise been gained – if there were any to be had at all.
However, that doesn’t mean that there aren’t real home field advantages to speak of. The home team wins an average of 57% of the time. There are outliers to this number, however. Baltimore was the biggest outlier in the data when they were at home and were playing in weather, winning on average 22-14 in adverse conditions. This makes some visceral sense given the strength of their defense during this period of time, and considering that offenses would have to battle against both it and the weather. But some interesting context is added when you consider that the data bears out that another defensive oriented team, the Pittsburgh Steelers, had the greatest home field advantage when weather wasn’t a concern. The worst home field advantage belongs to the Sunshine State’s Miami Dolphins, where the data says they lose on average 14-18 over the last ten years.
The data revealed some interesting things about the way the game is played. On first downs, 52% of the time it’s a run, and 42% of the time it’s a pass. On second down, it’s 45% run, and 49% pass. And on third downs, this changes dramatically, with runs falling to 26% and passing climbing to 66%. However, the thing that changed the way the game was played the most is the wind. At calm winds, 41% of the plays resulted in passes, and 37% were runs. But when the wind climbed higher than 30 MPH, this virtually flips, with 34% of plays resulting in passes, and 46% resulting in runs.
Where arrests were concerned, they didn’t have the impact that one might expect – when there was an arrest of a player on either the home team, away team, or both, the home team would win 57 percent of the time. While that’s a strange result, the next stat shows a startling trend in arrests in the NFL – in 2002, the number of teams with an arrest on their roster was at a low of 56%. In 2012, that number had climbed to an alarming 91%.
While there is a lot to be gleaned from the data, Anderson doubts that we’ll ever see the Moneyball scenario in the NFL. “When I look at all this data and watch the game, all 11 men on the field really do have to work together and have a part to play in the game, where in baseball, the stats are largely isolated between the pitcher vs. the batter,” he explains.
That doesn’t mean that data doesn’t have a role in shaping the future of the sport, says Anderson. “I think that data could improve some of the outcomes where you could merge the traditional scouting reports with the data to improve outcomes.”
That’s a data munging project for another day. In the meantime, I have a preconceived notion of my own that if you added a data set to correlate my fantasy roster to NFL team wins and losses during that time, the data would unfavorably implicate me. Sigh…
Enjoy the season!