How Walmart Uses Nvidia GPUs for Better Demand Forecasting
During a presentation at Nvidia’s GPU Technology Conference (GTC) this week, the director of data science for Walmart Labs shared how the company’s new GPU-based demand forecasting model achieved a 1.7% increase in forecast accuracy compared to the existing approach.
The technology lab for the world’s largest company was pitted against an existing demand forecasting system that was developed by JDA Software. That system was no slouch, but Walmart’s internal developers say they have come up with a better approach to predict demand for 100,000 different products carried at each of the company’s 4,700 or so stores in the United States.
Walmart’s JDA system is currently responsible for crunching historical sales data on a weekly basis to come up with demand forecasts for roughly 500 million item-by-store combinations in the US, said Walmart Labs‘ Distinguished Data Scientist and Director of Data Science John Bowman. “We forecast out a full 52-week horizon in weekly increments, and we generate a new set of forecasts every week, over the weekend,” he said.
Crunching all that data is computationally challenging, but the reward – more product on the shelves, fewer out-of-stock situations, and increased sales – are worth the cost. “If you didn’t have some sort of capacity constraint, we wouldn’t really care,” Bowman said. “We do have severe pipeline capacity constraints. We have about a 12-hour window to perform all of our forecasting tasks, and about three days to perform all of the training tasks.”
The JDA system uses an exponential smoothing approach to forecast US-wide sales on a weekly basis, which are pushed down to individual stores using “internal Walmart magic,” Bowman said. “It actually works quite well,” he said. “It beat three out of four external vendor solutions when we were doing a request for proposal a few years ago for new forecasting system. But of course, we won the contract internally because our forecasting approach was even better.”
In addition to being the biggest company in the world, Walmart is also the biggest grocer in the world. These factors expose the company to all sorts of irregularities stemming from people’s personal shopping and eating preferences, and it also stresses the accuracy of the forecasting system.
For example, during the Romaine lettuce recall last Thanksgiving, sales of Romaine lettuce plummeted as the supply dried up and the nation abruptly stopped eating the popular green. At the same time, Bowman noted that sales of chayote squash skyrocketed in the New Orleans, Louisiana area, which he learned was driven by peculiarities of Cajun cuisine. A similar spike can be seen around the New Year with cabbage, which Bowman says is a carryover from Western European tradition.
When Hurricane Harvey hit Houston, Texas in 2017, it severely impacted sales at 40 to 50 of the Wal-Mart stores in the area. If no adjustments were made, either in the data or the algorithm, the product forecast would (incorrectly) be up to 10% lower for those stores going forward, Bowman said.
“This is the sort of thing you see over and over again,” he said. “Something funny will happen to your data generating process–in our case external to the company–and you’ll get these sorts of outliers, US-wide in this case, that get into your data. When this sort of thing happens, you wind up with bad forecasts going forward, unless your algorithms are insensitive to it, so we use the term robust for that.”
The challenge for data scientists like Bowman is that creating robust machine learning models is not easy. “Most machine learning algorithms are not actually very robust to having bad data in them,” he said. “You can address that in any of three ways.”
The first (and worst) approach is that you hope and pray your data is better than what your competitors have, which it probably won’t be, and then rely on human beings to cope with the bad data. The second approach is to work hard to cleanse the data using “history-cleansing algorithms,” to ensure that pristine data is fed into the non-robust machine learning models.
The third (and best) approach is to take the time to build robust machine learning models that can withstand the occasionally erratic data that will inevitably find its way into your system, Bowman said. The tradeoff is that this approach is slow. “Robust algorithms tend to be extremely computationally intense,” Bowman said. “They’ll take thousands of times longer to run.”
From Spark to CUDA
Bowman’s first approach for re-designing the forecasting system was based on Apache Spark. The company used Scala to design a machine learning-base model that used about 400 different data features to generate forecasts for those 500 million store-items on a weekly basis.
However, the Spark experiment didn’t end well. “As we scaled our algorithm and used more and more of our data and encompassing more and more categories of items, we started to run into some severe problems,” Bowman said.
The Spark cluster would run just fine one time around, and then generate “garbage” the next time, Bowman said. After restarting the job, it would run fine, and then suddenly it would crash.
“So when this sort of thing happens, naturally we suspect some sort of memory leak somewhere. But we couldn’t find it and we couldn’t figure out a way of working around it,” he said. “We spent over six weeks attempting to debug the code, to restructure the code, and at the end of the six-plus weeks, we were unable to complete any feature engineering processes at all.”
During that time, the production Spark machines were useless, he said, although the existing JDA system was still generating forecasts. “We could not generate any forecasts,” Bowman said. “And of course, our users were not too happy with us. So we revised our feature engineering pipeline in a rather hurried manner, as one might expect.”
Speed to Rapids
The solution involved utilizing Nvidia’s Rapids software. The group rewrote the Spark code in a combination of R and C++, with a bit of CUDA so that it could run on a cluster composed of 14 Supermicro servers, each of which is outfitted with four Nvidia P100 GPUs.
“It turned out that when we were done with our frantic two weeks of coding, the performance of our R, C++, with a little extra CUDA was essentially the same as what we had gotten with the Spark cluster, except for the fact that it actually ran to completion,” Bowman said.
The main forecasting model is composed of a series of machine learning algorithms that are run together in an ensemble manner. This includes an internally developed state-space algorithm that was written in TensorFlow and borrowed from the e-commerce department, as well as a gradient boosting machine that was based largely on the XGBoost code included with Nvidia’s Rapids product.
The GBM generates forecasts for every product category using about 350 data features. Those include historical sales information, including sales from the last few weeks and sales from the same period last year. It also includes event and promotion features, such as the Super Bowl, and even Supplement Nutrition Assistance Program (SNAP) data from the Federal Government. With so many features and the need to run so many iterations, the GBM is very computationally intense.
The state-space model is based on a time-series algorithm that attempts to detect things like seasonal turns, Bowman said. The core algorithm at play here is the Kalman filter, which saw a significant speedup on the GPU. “The main workforce of our forecasting algorithm is the state space model, because it’s fast and for longer horizons, forecast accuracy is terribly important, so we use it for that,” Bowman said.
According to Bowman, the GBM model runs about 25x faster on the GPU code than the CPU it ran on previously. And because the company is able to run four parallelized versions at once (since it has four GPUs per server), it’s the equivalent of a 100x speedup per server.
“So this is nice,” Bowman said. “It saves us the effort of having to buy another 1,386 Supermicro boxes to get the same processing power out of it. And we’re running on three generation-old cards at the moment. If we were using the latest DGX workstations, that 25x would be substantially greater.”
The GPU-based system is currently handling about 20% of the forecasting work, while the JDA system handle the other 80%. Bowman said the plan calls for the GPU system to handle 100% of the work by the end of the year.
Embracing the GPU
Walmart Labs has lots of other algorithms in the works, including a random forest approach that could help deliver even better forecasts. However, since the company’s original random forest algorithm was written in Scikit Learn, it doesn’t run on a GPU, and therefore it’s not using it in production. But Walmart Labs is working with Nvidia to develop a version of the random forest algorithm that does leverage the power of GPUs.
Bowman said that one could expect to see a 30x to 50x performance improvement from running the random forest algorithm on GPUs, so “it’s something you’d really like to do,” he said. “Nvidia has been working with us on that. We’ll see where that entire project goes.”
Today the company uses an ensemble approach that mixes the GBM and state-space models using a weighted average. “In the great bulk of cases, you can be a lot more sophisticated about your ensembling approach, as those of you who pay attention to Kaggle competition knows,” he said. “But even something as simple as a weighted model significantly outperforms either of the two models [alone]. And we’re building out more sophisticated versions of the ensembles, where we’re incorporating features that have been fed into the GBM and so on and so forth and also looking at incorporating the random forecast into it as well.”
In the end analyses, the adoption of GPU processing has given Walmart a massive speedup in its ability to generate demand forecasts for all of its stores. But that’s not actually where the big benefit comes from all of this, Bowman said.
“From the point of view of the business, the advantage is that we have the ability to use more sophisticated, more complicated algorithms than we would otherwise be able to use at all,” he said. “And if you look at our suite of forecasting algorithms and say what is it that we would be able to achieve if we could not use XGBoost because it would run just too slowly [and] what is it we would be able to achieve if we could not do this other stuff? We see that it would cost us about 1.7 percentage points in forecast accuracy.”
Walmart sells about $330 billion in goods each year. While one can’t simply multiply that number by 1.017 to gauge the net economic benefit that the adoption of GPU computing has brought to the company, the evidence suggests the new approach will have a benefit measured in the billions of dollars when fully implemented by the end of 2019.