US Census Provides Big Data Grist for GPU Analytics Mill
As the United States Census Bureau begins to release data from its 2020 tally, companies are gearing up to analyze the new information and figure out how they can leverage it for competitive advantage. For large corporations that take a national view, the processing power of a GPU can be instrumental in accelerating the pace of discovery, particularly when mixing Census data with other large datasets.
One of the analytic firms that’s eager to help clients dig into the decennial survey of the American populace is OmniSci. The San Francisco company’s analytics and machine learning platform is powered by relational database designed to run atop a GPU, which speeds up interactive data discovery into addition to training machine learning models.
The 2020 Census is a perfect dataset to exploit the power of OmniSci’s platform, according to Mike Flaxman, who is the company’s data science practice lead as well as its geospatial product manager.
“Once you get to the Census block level and you’re doing national analyses, things get slow in traditional CPU-based analysis tools,” Flaxman says. “There are 200,000-ish Census block groups in the country. If you’re drawing a map of those by whatever variable you want, that can be a little painful using a traditional BI tool.”
According to Flaxman, the full Census data set is comprised of about 2,000 variables, which are provided as columns in a relational database. Since OmniSci’s analytic database is, by nature, a column store, it’s well-suited to processing that type of data, he says.
“The way we do things basically is we bring it into memory and process only the variables you’re interested in, and ignore the other ones,” Flaxman says. “So if you’re building models for site suitability and you have certain demographic criteria, maybe five or six of them, you’re pulling in five or six columns out of 2,000 potential columns.”
The GPU backend gives OmniSci’s the power to crunch large volumes of data, such as block-level US Census data, and render the results on the screen in something close to real-time. The rule of thumb for GPU versus CPU analytics is that it would take a cluster with 100 CPU nodes to equal the performance of a single GPU system, Flaxman says.
But in addition to the sheer data volume, OmniSci also gives the analyst or data scientist the flexibility to quickly add or subtract other data sets, without waiting around for cluster to rebalance the data, which can be fairly time-consuming when lots of interconnected nodes are involved.
“It’s both the initial size of the data, but it’s also the fluidability to grab five things out of 2,000, then change your mind and grab a sixth thing. How long does it take to do that?” Flaxman says. “At a certain horizontal scaling, you spend more of your time moving data around than actually computing on it, so what a GPU enables you to do is have millisecond analysis for all that stuff.”
OmniSci’s clients are eager to get their hands on the US Census data primarily because of the demographic data that it contains. While companies have many sources of data about their customers and potential customers, nothing beats the US Census for establishing that “ground truth” for geographically based demographic data, such as age, race, ethnicity, gender, income, number of children, number of vehicles, employment status, military status, access to Internet, etc.
“It’s the gold standard on which we hang a lot of other stuff,” says Flaxman, who has a PhD in landscape planning from Harvard University and previously worked at Esri, the leader in geographic information systems (GIS). “The Census essentially does an excellent job of where you sleep at night. It doesn’t do much for commercial and retail, but businesses are interested in both.”
Companies are turning to OmniSci to help them slice and dice this demographic data, which helps inform decisions such as where to build new stores, where to place 5G cell towers, or how to price insurance.
“The main thing that people are looking at with the 2020 stuff is demographic shifts,” Flaxman tells Datanami. “The fact that people are moving around in the nation is core to all their businesses.”
2020 was unique in a lot of ways, including the fact that the US was suffering from the COVID-19 pandemic and the ensuing economic shutdowns. Huge swaths of retail and commercial real estate went dark, and business leaders are looking for clues as to how these segments will play out going forward.
“It seems like America needs a lot less retail and commercial space that it used to, and we need housing in a different pattern than we’re used to,” Flaxman says. “A lot of urban centers have that pattern, because people are having fewer children, taking longer before they have children, and staying in urban active zones longer than they used to historically.”
The country appears to be on the verge of seismic shifts in land-use planning, with large-scale changes in how we live, work, and play, according to Flaxman. Many factors are coming together, including the success of work-from-home, the boom in house values, the housing affordability crises, and the continued out-migration from expensive coastal zones all combine to present a rich tapestry of demographic data describing the current situation.
The Census data–as well as the annual American Community Survey that the Census Bureau updates annually–can help inform that decision-making, not only at the local planning level, but for the millions of businesses that want to be ready to capitalize on the shifting demographics of the American populace at the national level, Flaxman says.
“We can interleave our work and living environments more closely than we did back in the day when we were big industrial power. It was important to separate people from smokestacks. There are good reason for doing that,” he says. “What happens to these former retail sites or former commercial sites, probably will be a lot like we saw with the de-industrialization” of American cities.
OmniSci’s customers will mix the Census data with other data, including data from smart phones and retail systems, to get a better picture of how Americans’ work and play habits are changing. While these datasets can provide compelling insight into where Americans go and what they do during the day, the Census data provides that baseline demographic data based on where people live that really can’t be obtained any other way.
“We’re blending mobility data with Census data constantly, and retail point of presence as well,” Flaxman says. “But everybody relies on the Census to give us the background on the 100% count that takes massive resources to actually accomplish.”