If all the data that exists in the world were represented as 2-hour high definition movies, it would take a human 47 million years to watch all of them. These ridiculous statements emphasize the flood of data that is engulfing the world. Quickly, government agencies are going to have to figure out how to deal with it comprehensively.
A report from TechAmerica Foundation’s Federal Big Data Commission is designed to help identify for the government agencies the points of emphasis in the evolving big data world and suggest a plan of action. The commission is headed by Steve Mills of IBM, Steve Lucas of SAP, and Michael Rappa of NC State and includes various people from industry-leading vendors such as Cloudera, NetApp, Amazon, and EMC among others.
The report emphasizes and re-emphasizes two things: education as a future driver of big data advancement and the guiding of agencies toward viable solutions to deal with the data overload.
What is frequently lost in the national job creation debate is exactly which areas of the economy can spawn the highest amount of jobs. Currently, technology is one of the areas where the manpower resources do not sufficiently meet the industry’s demand, especially with regard to managing big data. The point is that there are plenty of data scientist jobs to be had if enough is invested in training and educating.
Healthcare is a hot button topic in this country due to its rising cost and apparent inefficiency. “Big data can help with that,” the report optimistically states. It should be no surprise that the digitalization of health records in the country has left the healthcare industry awash in data. According to the report, the industry produced 150 exabytes of data in 2009. It is safe to assume that number has increased significantly over the last three years.
The use cases of improving the efficiency of government agencies through big data are seemingly endless. From education to transportation to energy, big data can eventually be applied to pretty much anything. What is more interesting is exactly what is being done to transform the data into insight.
While representatives from vendors were included in the formation of this report, the large sample size ensures a relative lack of bias. As such, Hadoop was identified as a good research-oriented tool that leaves a little to be desired regarding real-time analysis and streaming. According to the report, “Hadoop is good for finding a needle in the haystack among data that may or may not be “core data” for an agency today. Hadoop and Hadoop-like technologies tend to work with a batch paradigm that is good for many workloads, but is frequently not sufficient for streaming analysis or fast interactive queries.”
The report notes some “big data accelerators” such as text extraction tools that can help with the quicker or more variable demands.
Interestingly, the report recommends approaching the three V’s as entry points. "Some initiatives do indeed leverage a combination of these entry points, but experience shows these are the exception.”
This means taking a divide-and-conquer approach within the agency to attack each individual use case. For example, a use case requiring real-time streaming and decision making would want to focus on velocity, while perhaps a more research-intensive, time-independent use case can focus on greater variety or volume.
The report is lengthy, but it succeeds in providing for the government some viable big data guidelines going forward.