Big Data Future Hinges on Variety
There are numerous factors that define and describe big data, but the community has settled on three primary elements: data volume, data velocity and data variety. This schema was originally put forth by Gartner analyst Doug Laney in the late 90s and has since become a de facto industry standard for defining big data. Over time, other analysts and pundits have expanded the framework to incorporate as many as seven Vs and Laney himself quipped that he has come up with a definitive 12 Vs “to cover the full spectrum of big data challenges and opportunities.”
Of the three Vs, the first two, volume and velocity, get a lot of attention, but are actually pretty straightforward. Challenging to be sure, but it’s essentially a matter of throwing enough processing power, networking capability and storage capacity at the problem. Hadoop has made this process more tractable from a cost perspective, by distributing computing across cheap vanilla hardware. Variety, on the other hand, takes a lot more finesse. Variety refers to multiple data types and sources. Put simply, the more diverse the data, the bigger the headache.
In his original paper, introducing the three Vs, Laney wrote that “the variety of incompatible data formats, non-aligned data structures, and inconsistent data semantics” comprised the principal barrier to effective data management.
While new technologies and frameworks have been developed to deal with volume and velocity, data variety remains resistant to software solutions. In a recent article in SmartDataCollective, Mahesh Kumar maintains that data variety is the unsolved problem in big data, mainly because it’s so difficult to solve programmatically. To make sense of disparate data streams, first the data must be put into context, which requires human intervention and domain-specific application expertise. The lack of automation is preventing the industry from realizing the full potential of big data.
“I think that the problem lies in data variety,” Kumar writes, “the sheer complexity of the multitude of data sources, good and bad data mixed together, multiple formats, multiple units and the list goes on.”
“As a result of this unsolved problem, we’re grooming a large field of specialists with proficiency in specific domains, such as marketing data, social media data, telco data, etc. And we’re paying those people well, because their skills are both valuable and relatively scarce,” he adds.
Dealing with the “high-variety” aspect of the big data challenge is expensive from a time and money standpoint. It forces business to rely on people and services. In support of this point, Kumar cites recent figures from Gartner, which show that in the big data realm, for every dollar spent on software nine dollars are spent on services. Gartner estimates that services spending will surpass $40 billion by 2016.
The flip side of the variety challenge is the enormous opportunity for innovative solutions that can hide some of the complexity and cost. Solving these challenges will open up the advantages of big data to a greater number of users.
“The most important V is variety,” observed Quocirca research analyst Clive Longbottom in a 2012 IT Pro report, “if you cannot deal with a variety of streams coming through, you’re not doing Big Data.”