Follow Datanami:
October 24, 2014

Today’s Baseball Analytics Make Moneyball Look Like Child’s Play

Baseball has always been a game of numbers and statistics. But thanks to an explosion of data over the past seven years and the advent of new analytic software running on supercomputers, the game is on the cusp of changes that will make Moneyball look like it belongs in the minor leagues.

When the San Francisco Giants take the field against the Kansas City Royals in game three of the World Series tonight, you can bet that the choices made by managers and coaches will have a big impact on the outcome of the game. From choosing the lineup and starting pitchers to selecting pinch hitters and relief pitchers, the in-game decisions that the Giant’s Bruce Bochy and the Royal’s Ned Yost make will be instrumental to winning.

Of course, that’s true for all 2,000-odd games that Major League Baseball holds in a given season. In fact, managers have played a critical role in over the course of all 180,000 games played across the entire 140-year history of professional baseball. But what’s different today is the amount and quality of actionable information available to managers. Data is now being generated in a continuous stream, and the best teams are finding ways to exploit it to their advantage.

“What we’ve seen is an absolute explosion of baseball data,” said Vince Gennaro, an MLB consultant and president of the Society for American Baseball Research (SABR), the organization that pioneered the “sabermetrics” approach to empirical analysis that Oakland A’s general manager Billy Beane implemented, and which Michael Lewis made famous in his 2003 bestselling book “Moneyball.”basebal data growth

The advent of Moneyball was clearly a breakthrough for general managers faced with making personnel decisions, Gennaro said last month during Tabor Communication’s Enterprise HPC event in Carlsbad, California, where he presented a keynote. “The irony is it preceded this data explosion. You can take the first 135 years of baseball data–the game accounts and what happened during games–and you can put that on a 2GB flash drive,” he says.

With the new data that MLB is capturing from high-speed video and Doppler radar, each game is on the cusp of generating 1TB of data. “So we’re talking about a 10-million fold increase in data capture,” Gennaro says. “We’re getting tracking of everything that goes on the field during the entire baseball game.”

Baseballs’ Big Data Era

The data explosion began in 2007 with the introduction of the Pitchf/x system, which tracks more than 20 pieces of data for each pitch, including the velocity of the pitch, the three-dimensional movement of the ball, and the pitcher’s arm angle. Then the Hitf/x system added another five measures for each batted ball in play. This year, MLB installed the so-called Fieldf/x system to track the movements of fielders and baserunners. It was used in just three ballparks but will be expanded to all ballparks next year.

Suddenly, MLB teams have access to a treasure trove of data about each play and player. “Not only do we have detailed outcome data about what happened, but we have process data now–what’s happening during the at-bat, what’s the interaction between the pitcher and batter as the at-bat is taking place–not just the fact that the result was a single to left field,” he said.


“In much the same way that the government uses complex graphs to search for relationships in disparate data sets, we’re now doing the same in baseball,” says Vince Gennaro, an MLB consultant and president of the Society for American Baseball Research (SABR).

Gennaro, who honed his analytic skills during a 20-year career at Pepsico and is the director of the Graduate Sports Management program at Columbia University, is at the cutting-edge of the application of advanced analytics in baseball. While there are many directions one could go with all this data, the one that Gennaro has focused on is the pitcher-batter matchup.

What struck Gennaro was the utter lack of good data about pitcher-batter matchups. “The problem is for the most part people are being very literal about it,” he said. “‘Well this batter is 1-for 6 against this pitcher or he’s 0 for 4.’ I can assure that 1 for 6 and 0 for 4 are not statistically significant. They don’t mean anything.”

And even if the sample size is a bit bigger, and a given hitter has faced a given pitcher perhaps 40 times over the course of a career, it still doesn’t mean much. “Who really cares what Derek Jeter did against a pitcher back in 2004?” he said. “Is that really relevant today? They’re both very different pitchers and hitters if they’re both still in the game today.”

Graphing Matchups

Instead of the historical approach to predicting the results of hitter-pitcher matchups, Gennaro employed what he describes as his Netflix approach. The movie service is expert at tracking its customers viewing habits and making pertinent recommendations about movies. Netflix does this by breaking movies down into constituent parts that can be analyzed and correlated with user preferences. Gennaro did the same thing with pitchers and hitters, using all that data generated by the Pitchf/x and Hitf/x systems.

“Instead of focusing on how one batter performs against one pitcher, I looked at how a batter performs against a pitcher’s attributes,” he said. “I went that extra level deeper and looked at a hitter against all pitchers and analyzed the pitcher attributes that he’s successful against or what he struggles against, from velocity to pitch movement to repertoire to release point and pitch sequences. In total we had 14 attributes and looked at them in all possible combinations.”

RH pitchers vs LH batters cluster

This graph shows the clusters of right-handed hitters versus left-handed batters

The software that Gennaro wrote implements a model that’s based on five elements–including pitching style, pitcher quality, hitting style, hitter quality, and ballpark–that can be scaled up and down in relative importance. The model–which runs a graph database running on a high-end Urika appliance from YarcData, a subsidiary of supercomputer maker Cray—is loaded with real-world baseball data collected over the past 18 months, or close to 1 million pitches and a quarter million or so balls that are put into play (older data is less relevant, he says).

The idea is to give managers deeper insight into how to shape the batter-pitcher matchups to their advantage. The model rates each player, in percentiles, against each pitcher, based on the batter’s qualities and the pitcher’s qualities, normalized for the ballpark.

So if the New York Yankee’s Ichiro Suziki displays a particular aptitude for hitting crafty left handers instead of power righties, it can tell Yankees manager Joe Girardi that he is probably better off going with Brennan Boesch against the Colorado Rockie’s starter Juan Nicasio at Coors Field. “What I want to do is give a manger an unvarnished perspective on how, all things being equal, these two batter-pitchers should perform against each other in the setting of Colorado,” Gennaro said.

Show Me the Data

So, does it work? You bet it does. “What we found is the projections are remarkably accurate–frankly more so than I anticipated,” the “Diamond Dollars” author said. At Colorado, Girardi went with Suzuki, who went 0 for 4 against Nicasio. Boesch was brought in as a pinch hitter and got a hit that won the game (which doesn’t mean anything, according to Gennaro, who admits to his bias).


Gennaro worked with graphs loaded onto the Urika appliance from YarcData, a Cray subsidiary.

Gennaro said that if his approach was fully implemented–which means it’s used for setting the starting lineup, for picking pinch hitters, and for picking relief pitchers–that it would have a net impact of 33 runs over the course of the season (for a contending team). Gennaro, who is obviously a lover of numbers and statistics, took it a bit further. If each 10 runs corresponds with one win, he figured, then this approach could result in three wins for a given team. What’s more, if each win cost an average of $5 million (for a contending team), then the potential value to this approach is $15 million.

While he wasn’t at liberty to discuss which MLB teams have purchased his services, it’s clear that teams are increasingly open to what analytic can provide them. “I can tell you there are some organizations that are building analytics cultures from the bottom up,” he said. (For more insight into Cray’s relationship with MLB and how it’s potentially using the YarkData appliance, read HPCwire editor Nicole Hemsoth’s April story “Inside Major League Baseball’s “Hypothesis Machine.”

Still, there is some hesitation among MLB teams, which are notoriously superstitious and resistant to change. But they’re not the only ones with an interest in what happens on the field. People who bet on games, for instance, have expressed a keen interest. “I’ve gotten a lot more calls from bettors than I have from teams about this,” Gennaro said. “They’re early adopters.”

Related Items:

Playing Ball with MLB’s New Analytic Data Feed

Come On, Blue! Data Reveals Umpires’ Biases

Athletes Find an Edge with Performance-Enhancing Data