MLB Hits a Home Run with BigQuery Migration
The Padres’ Fernando Tatis Jr. may be delighting baseball fans with his energetic style of play during this COVID-shortened season. But there’s plenty of excitement in the back office too, now that Major League Baseball has completed its migration to Google Big Query.
MLB previously used an enterprise data warehousing system from Teradata to analyze a variety of data relevant to the league, according to Robert Goretsky, MLB’s vice president of data engineering.
About 1,000 active database tables were continuously updated by more than 350 data pipelines maintained in Airflow and Informatica, Goretsky wrote in a post on the Google Cloud blog. The data originated from third-party and internal sources, including revenue, ticket sales, and fan engagement data, among others.
This data was accessed by a range of users in the MLB organization, as well as the 30 individual ballclubs, including product, marketing, finance, ticketing, shop, analytics, and data science departments, he writes. The primary BI clients were from Looker (a part of Google Cloud now) and Business Objects, which is owned by SAP.
In 2018, MLB decided to explore a data warehouse migration. After a successful proof of concept on Google Big Query in early 2019, the decision was made in May 2019 to migrate off Teradata. The organization managed to complete the migration in just seven months, Goretsky writes.
According to Goretsky, the baseball organizations has realized “numerous benefits” as a result of the migration. In addition to the 50% speed boost on query completion times, the organization is able to run bigger queries without worry.
“In many cases, queries that would simply time out or fail on Teradata (and impact the entire system in the process), or that were not feasible to even consider loading into Teradata, run without issue on BigQuery,” he writes.
Google’s pricing model for BigQuery is also paying dividends for MLB, according to Goretsky. “As MLB underwent the migration effort, BigQuery cost increased linearly with the number of workloads migrated,” he writes. “By switching from on-demand to flat-rate pricing using BigQuery Reservations, we are able to fix our costs and avoid surprise overages (there’s always that one user who accidentally runs a ‘SELECT * FROM’ the largest table), and share unused capacity with other departments in our organization, including our data science and analytics teams.”
Giving users direct access to the data warehouse is also easier under BigQuery, he says. It was “cumbersome” to do that under the previous setup, which required synchronizing data from Teradata to AWS S3 buckets, where individual ballclubs could access it.
“BigQuery made it trivial to securely share datasets with any G Suite user or group with the click of a button,” he writes.
The organization was also able to eliminate a database administrator (DBA) role with BigQuery, he says. Developers, data engineers, and data scientists are also happier with the improved documentation, Goretsky says.
The organization now has the time to tackle new projects, such as OneView, which combines data from 30 sources into a single table, with one row per fan. Such a project previously would have required a considerable amount of work to ensure that the tables are rebuilt on a regular basis, but runs without much drama under BigQuery, according to Goretsky.
MLB’s editorial team is also benefiting from a new real-time reporting feature that streams data from Google’s Pub/Sub streaming data bus into BigQuery, where it’s automatically populated into Looker dashboards. They would do well to keep their eyes on San Diego’s 21-year-old phenom, Tatis Jr., who this week became the first player ever to hit 30 home runs and steal 20 bases in his first 100 games played.