Follow Datanami:
November 13, 2013

The ABC’s of the USPS Big Data System

Isaac Lopez

Volume and velocity are significant challenges for the United States Postal Service (USPS), who in addition to needing to sort approximately 160 billion pieces of mail in its 275 routing centers around the country, also needs to perform sophisticated fraud detection scanning at rates up to 4 billion items per day.

Recently, FedCentric Technologies announced that it was awarded a $16.7 million contract to expand on work that it’s been doing for the USPS in building a real-time system for these purposes. This week, Datanami has learned more details of the system that the company is working on, which according to Gerry Kolosvary, President of FedCentric Technologies, is built using the SGI UV 2000 system.

“What [the USPS] has tried to do in the past is use traditional approaches,” said Kolosvary, explaining USPS’s odyssey in getting a grip on its big data problem. These traditional approaches, he explained, have been X86-based and proprietary Sun Systems where they’ve built either scale-up and scale-out platforms that ultimately didn’t work.

“It’s the vast volume,” said Kolosvary, explaining why neither architecture worked for them. USPS was processing 2,500 scans per second and the systems were bottlenecking at the network level, he said. The USPS needed to find a different approach.

Enter FedCentric Technologies, which Kolosvary categorizes as a federal systems integrator that specializes in big data. He says that, while everyone is focused on the “V’s” of big data, FedCentric has come up with its own process-focused alliteration around the ABC’s of big data. (Note: while as a journalist this sort of thing can be annoying, it’s also instructive when you consider the order of magnitude increase in performance they are able to get using the approach, which we’ll talk about shortly.)

The architectural difference starts with what Kolosvary refers to as “Affinity.” “Affinity is about how close your data is to each other,” he explains. “How easy is it for you to make a connection and to see something that isn’t apparent, but becomes apparent once you make that connection? The relative closeness of the data becomes very important.”

“Boundaries” is the next process issue, explained Kolosvary, who added that in the system where the USPS was using a scale out approach, the boundary was really at the blade level. “The way I define boundary is how much of the time does your data and your application spend on a CPU and in-memory,” he explained. “The boundaries of the scale out system are defined by the blade.”

“Connectivity” becomes the next important issue to consider. Once you cross the boundary, you’re out on the network. “At that point, the network becomes an integral part of the compute process, and at that point you’re really going only as fast as the network can go. The network becomes your weakest link,” he said

Finally, Kolosvary explained that they’ve added the letter D for “Domains,” referring to the architectural sections of the overall system. “How much time does your data spend in the compute domain vs. when it crosses the boundary into the network or the I/O domain?”

Using this framework, Kolosvary said that FedCentric Technologies developed an approach that they put into place for the USPS leveraging the SGI UV 2000 supercomputer system. “The boundaries of this approach are not at the blade level,” he said. “They’re at the system level. So where I’m limited on a X86 scale out system to 40 or 48 cores and 1.5 terabytes, with the SGI system I can grow and scale that to 4096 cores and 16 terabytes of RAM in a boundary before I have to do anything – before I have to go to a network, and before I have to go to disk. That means my affinity can really be close.”

“In fact,” he continued, “I never have to leave memory, so it’s only a couple of hops away using a computer backplane, not a network. So the connectivity is tightly coupled and so we can make these data connections very quickly on the system because there aren’t any boundaries… as far as domains go, we stay in the compute domain almost all the time.”

The results of this approach, he says, have been dramatic. “Using our approach, they are able to get about 3.5 million scans per second – several orders of magnitude increase in performance.” This is a huge jump from the 2,500 scans per second noted earlier.

The system is working and FedCentric has been rewarded for the success with a contract to expand the system by building four more of the supercomputer-based systems at the USPS’s Eagan, Minn. Facility. And we, of course, get another example of how big data (and big compute) are affecting our lives in hidden ways.

Related items:                                                                         

Postal Service Gets $16.7 Million Supercomputing Upgrade 

Dat Wants to be the GitHub for Data 

FoundationDB Gets $17M to Push ACID Machines 

Datanami