Follow Datanami:
August 2, 2012

The Big Data Behind Biometric Identity

Datanami Staff

It is difficult enough to accurately represent the population of the United States in a census. The country has 300 million people in it, many of whom move around and some of whom are intentionally hard to find. Even more difficult is mandating identification card requirements for overage citizens and storing all the resulting data.

India has about four times the population of the US, a quarter of which migrates annually (basically an entire USA population is moving every year), and 40% of which survive on two dollars a day or less, giving them significantly more pressing issues than that of national identification.

As a result, not only is a national identification project a massive undertaking for the government’s organizers, but also presents a big challenge to those who wish to store and retrieve that data. The challenge indeed was so grand, Dr. Pramod Varma and Regunath Balasubramanian were called to The Fifth Elephant 2012 to talk about how they are pulling off their “Aadhaar” project, which is being called the “world’s largest biometric identity project.”

The goals of the project are relatively simple to understand: one, to enroll the populace of India into the Aadhaar system in which name, address, and multi-modal biometrics (in this case, fingerprint and iris scans) are taken in and each citizen is assigned a 12-digit pin code, and two, to use that information to verify one’s identity in the context of various official events.

As stated earlier, the inherent challenge lies in the sheer volume of data 1.2 billion people can create. The objective was to “enroll” a million people a day, taking four years to enroll the entire population. Every time someone was newly enrolled the database, the system would have to check that his or her enrollment was not a duplicate. While it was acceptable for this process to take several days, it was not acceptable for the verification process, where the target transaction time was under a second.

In pure numbers, Aadhaar was to handle 100+ million verifications for every ten hours a day, translating to four terabytes of audit log data for every ten days, each byte of which must be ‘guaranteed.’

Varma noted three key architecture principles when building his system: making sure every component was scalable to big data, not being handcuffed by any particular vendor, and recognizing that everything fails at some point. Not being handcuffed by a particular vendor meant keeping the project as open source as possible. However, even in an open source environment, security and privacy, for obvious reasons, were kept among the highest of priorities to Aadhaar.

Scalability for all components was important to do away with bottlenecks entirely. As stated above, there could be no situation in which a validation was requested and not received. The multiple vendor approach also helped to achieve this end by building redundancy. Again, the initial assumption that fueled these intricacies was that everything can and will fail at some point or another and that building redundancy and reducing bottlenecks throughout the system will minimize the damages of these failures.

Some examples of the multiple vendor approach include using Mule, RabbitMQ, and GridGain for Staged Event Driven Architecture, and using both MySQL and HBase for high volume, moderate latency data access systems. Indeed, different vendors were used for each of the data access necessities (Hadoop’s HDFS and Hive were used for the all-important throughput streaming largely responsible for identification verifications).

Varma also explored the NoSQL vs. SQL argument, or rather he dismissed it. According to Varma, relational databases have their purpose. He speaks of having lost a significant amount of data as a result of NoSQL databases and how MySQL ended up carrying the day. This is not a surprise as Varma was essentially building one massive relational database. It is true that he was perhaps building the largest relational database in the world, but its purpose is to essentially take any Indian citizen and be able to relate them to people who are similar.

Indeed, one of the social functions Varma hoped to use the system for was to give the Indian population, especially those who are living off two dollars or less a day, a sense of how people similar to them but not necessarily near them are faring. This hypothetically builds a sense of community and satisfaction among the poor even if it does not build their income.

Either way, building a national identification system for any country in the modern age, where everything is expected to exist digitally in some form or another, is a challenge. In India, Varma and Balasubramanian have apparently met that challenge for one of the largest and most nomadic countries in the world.

Related Stories

Behind the Scenes of NEC’s Biometrics

Researchers Turn Data into Dynamic Demographics

SAS Spots Targets in the Wild

Datanami