Follow Datanami:
January 9, 2018

The Big Data Tech Inside the 2020 Census


The US Census Bureau is adopting the latest data processing technology to help with its upcoming 2020 Census, including the use of a large Hadoop cluster, real-time stream data processing, and advanced mapping and visualization products.

While the 2020 Census won’t be entirely paperless, it will be the first national census that’s conducted predominantly in an electronic manner. The US Census Bureau says it will be the first census “with a full Internet option,” as well as the first to use electronic devices to manage and conduct fieldwork instead of paper.

The bureau is currently ramping up its technological prowess to ensure this massive digital transformation project goes smoothly. A key part of that investment is a contract it signed last year with Hortonworks to provide the underlying data management layer.

According to Shaun Bierweiler, vice president of the U.S. public sector business at Hortonworks, the deal will span all of the company’s offerings, including the Hortonworks Data Platform (HDP) Hadoop distribution and its Hortonworks Data Flow (HDF) stream processing system.

Scalability was a big reason Hortonworks won the contract, Bierweiler says. “When you think about the approximately 326 million Americans that the Census Bureau is going to collect and store data on,” Bierweiler tells Datanami, “you need a data platform that’s going to not just perform, but really operate at that industrial scale.”

The decennial census is used to collect a variety of data about life in the United States

HDP will form the main data lake — what the Census Bureau calls the Census Data Lake – that stores the lion’s share of the census data. It will also function as a staging ground for joining data from other databases, including data from other agencies that’s used to remotely identify vacant housing units without needing to send field personnel to inspect it. The cluster is expected to store both structured data (including names, addresses, and individuals’ answers to demographic questions), as well as unstructured data, such as pictures taken from Google Maps pictures or aerial imagery.

The Census Bureau says it plans to use extensive aerial and street-level imagery during the 2020 Census, both on the front-end (data collection) and back-end (data analysis) stages of the project. On the front-end, visualization will help streamline address identification, thereby minimizing the number of workers going door to door. And when workers do hit the streets, they’ll gather the data via mobile devices equipped with GIS-based navigation and routing, as explained in this informative article from GIS software provider Esri.

“We’re using the technology in a much greater way than we ever have before,” Tim Trainor, the US Census Bureau’s chief geospatial scientist, tells Esri. “There are pockets of [GIS] development going on all over that are eventually feeding into the big picture.”

The Census Bureau has been debating how to approach the 2020 Census for some time. Back in 2013, the agency pondered whether it should dive into the “big data” phenomenon. “Can Big Data reliably supply the social, demographic, health behavior, and business activity information required for a 21st-century society? Our current answer to this question is, ‘Not yet,’” officials with the Census Bureau wrote five years ago.

The agency has clearly rethought that initial assessment, and is eagerly welcoming new data types into the fold. This is smart thinking, according to Bierweiler, who says the combination of new sources of data and advances in data processing will enable the Census Bureau to do things that were not previously possible.

“Once you have a foundation that supports [marrying structured and unstructured data], really the opportunities are endless,” he says. “They have the flexibility of no longer being confined to name, address, and Zip Code.”

Hadoop provides the right combination of scalability and flexibility to enable the Census Bureau to tackle the new data processing tasks it’s set out for itself, Bierweiler says. Trying to accomplish these goals with a traditional relational database would probably not end so well, he says.

“If all you’re storing is structured data, then perhaps a relational database is the right fit,” he says. “But when you look at the digital means for which they’re going to do with Census 2020 and the marrying of structured and unstructured data….that’s no longer going to fit nicely into a relational database.”

The Census Bureau is expected to use HDF to help funnel the flow of data gathered from census fieldworkers into the main Census Data Lake. It could also provide the Census Bureau with a metadata-based lineage that allows it to trace a questionable piece of data back to its source, Bierweiler says.

The Census Bureau has committed to providing the general public with more data – and data with a higher granularity of detail — as part of the 2020 Census project. Much of the processing required to hammer the raw data into usable information is slated to occur on the Hortonworks platform – likely through one of the many SQL-based query engines that run on the Hadoop distribution. The results will be shared via the agency’s website.

In addition to using new big data and digital tech to gather statistics that describe life in America, the Census Bureau is also using embracing new big data-powered marketing techniques to improve its outreach efforts ahead of and during the actual census.

The IT work for the 2020 Census has already begun. The plan calls for the HDP foundation to be “locked down” over the next year or so, while 2019 is dedicated for test and evaluation, Bierweiler says. In addition to supplying software, Hortonworks is providing a few technical staff members to assist the government’s technology team and the systems integrators working on the project.

“We are a very small part of a much bigger team,” he says, “but we’re there to help ensure that our data platform is able to make this data as available as expected as needed to really help convert the raw data sources into actionable intelligence.”

Hortonworks would not comment on the value of the deal. However, a recent story in the Motley Fool reports that the Mountain View, California company recently won an $8.1 million contract from the Department of Commerce. That would make the 2020 Census deal the biggest deal for Hortonworks over its short six-year lifetime, according to the story.

The overall 2020 Census is expected to cost more than $15 billion. When completed, the population count will be used by states to draw new boundary lines for the House of Representatives.

Related Items:

Census Bureau Ponder Role of Outside Data Sources

Researchers Turn Data into Dynamic Demographics