Beyond the 3 Vs: Where Is Big Data Now?
Once defined by the “three Vs” of volume, velocity, and variety, the term “big data” has overflowed the small buckets we gave it and taken on a life of its own. Today, big data refers not only to the ongoing information explosion, but also an entire ecosystem of new technologies, as well as a unique way of thinking about data itself. Here’s a quick recap of the short history of big data.
The World Wide Web was just starting to spread its wings back in 2001 when Gartner analyst Doug Laney came up with the three Vs definition. Even at that time, when there were just 361 million users on the Internet, it was clear that the geometric growth rate of data across the world had huge implications, not just for the servers, storage arrays, and networks that had to grapple with all those bits and bytes, but for society as a whole.
In 2003, 5 exabytes of data was generated across the entire world, more than 90 percent of it originating digitally. By 2004, the number of people on the Internet has more than doubled, as new social media sites like Facebook provide new ways for people to interact online.
As the Web expanded, the existing systems used to power applications–largely based on EF Codd’s relational database technology from the 1970s–begin to show signs of stress. In 2004, Google researchers Jeffrey Dean and Sanjay Ghemawat publish the seminal paper “MapReduce: Simplified Data Processing on Large Clusters.”
In 2005, two Yahoo engineers, Doug Cutting and Mike Cafarella, create Hadoop to support the Nutch search engine project. The combination of MapReduce and the Hadoop Distributed File System (HDFS) provided Yahoo an economical way to lash together large numbers of commodity Intel servers to run large batch jobs, such as indexing the Internet. The novel approach to computing was soon copied by most of the large Internet firms in the Valley.
More than 1 billion people are on the Internet by 2006, when the world generated 161 exabytes of data. IDC forecasted that the world’s data would double every 18 months through 2010–the same rate of growth that Intel founder Gordon Moore proscribed to processing power, ironically. It turns out IDC underestimated that growth rate—perhaps it didn’t factor Apple‘s 2007 introduction of the iPhone into the equation, or foresee the course to hyperscale cloud computing that Amazon set us on when it opened Amazon Web Services (AWS) in the same year.
By 2008, as Hadoop was starting to spread organically, a group of engineers created a company named Cloudera to productize the open source software. Cutting joined the company as chief architect a year later, as another company by the name of MapR Technologies was founded. Yahoo would spin out a third Hadoop distributor, Hortonworks, three years later.
This period—roughly from 2006 to 2012–was remarkable period of technical innovation, as it set the stage for the incredible rise in distributed computing that we’re now enjoing. Many big data projects began during this time, most of which continue today as open source projects at the Apache Software Foundation. Some of the notable projects include:
- BigTable – Google described this NoSQL data store in an influential 2006 paper
- HBase – NoSQL-like data store for Hadoop modeled after BigTable, created by Powerset (now Microsoft) in 2007
- Cassandra – Facebook originally developed this offshoot of BigTable in 2008
- Hive –Created by Facebook to provide a SQL-like store in Hadoop
- Voldemort — LinkedIn created this distributed key value store in 2009
- Spark – An in-memory compute framework created at UC Berkeley’s AMPlab in 2009, and released as open source in 2010
- Tajo – A SQL query engine for Hadoop created in South Korea in 2010
- Kafka – Data ingest framework originally developed by LinkedIn and subsequently open sourced in early 2011
- DynamoDB—Amazon unveiled this hosted NoSQL database service in 2012
- Storm – Stream computing freamwork originally developed by Twitter, and released in 2011
- Impala – SQL query engine created by Cloudera in 2012
As Hadoop was taking flight, a parallel movement was also gaining steam around so-called NoSQL databases. Traditional relational databases aren’t good at handling the large amounts of increasingly unstructured data–pictures, videos, sound, etc. –that made up the Web. Schema-less NoSQL databases allowed people to trade the consistency that traditional fixed-schema SQL databases offered in exchange for better scalability and a better ability to handle unstructured data.
As commercial open source NoSQL database vendors like Couchbase, Datastax, Marklogic, MongoDB, and Neo4j gained traction, their old-school SQL cousins started getting hip to the new paradigm. In response, we started seeing IBM, Microsoft, and Oracle adding new features–such as in-memory capabilities, column- and graph-oriented structures, and support for new data types like JSON–in an attempt to shore up their relational databases (and maintenance revenue streams) for the new big data requirements of the day.
At the same time, a new class of scale-out SQL databases such as VoltDB, NuoDB, and MemSQL, started creating new products based on good old relational technologies. These NewSQL database backers content that you don’t have to give up relational database constructs, like strong consistency, just to get scalability.
By 2010, more than 2 billion people were on the Web, and the big data craze was in full swing. Fortune 100 companies started seeing value in Hadoop for making sense of messy, unstructured data. The possibility of harnessing machine learning algorithms to do predictive analytics on data in Hadoop begins to emerge. Hadoop use cases expand: Netflix uses Hadoop to recommend movies; banks use it to gauge portfolio risk and detect fraudulent transactions; phone companies use it to forecast customer churn; marketers use it to serve targeted ads; retailers use it to drive promotions; and power companies use it to predict grid failures. The big Hadoop clusters in the Silicon Valley get even bigger–Yahoo’s cluster topped 40,000 nodes, while Facebook’s Hadoop cluster had more than 30 petabytes of data.
Big data really hit the big time in 2012 (or thereabouts) as companies in all types of industries begin hearing about this thing called Hadoop, and the Harvard Business Review famously called data scientist “the sexiest job of the 21st century.” By this point, it was starting to become evident to CEOs that not only was it becoming feasible to make sense of all the data humans were generating—not only from comments posted on social media and location-data traced by smartphones, but also in hiring and fraud detection and education and sports and government and medicine too—but that it might soon be necessary, from a competitive viewpoint. The big data land grab was on.
By 2013, another unexpected thing happened: the Internet of Things became, well, a thing. If you thought that humans could generate a lot of data, we’re told, then wait until you see how much data a machine can generate. For example, on a single flight, the engines on a Boeing 787 could generate half a terabyte of data, which airlines would use to do predictive maintenance. With 26 billion devices on the IoT by 2020, the data tsunami will only multiply, Gartner told us.
By 2014, more than 3 billion people were on the Web, 1.4 billion people had smartphones, and the world was generating about 2.5 exabytes of data per day. The race to become a data-driven enterprise pushed the Hadoop ecosystem into full swing, and venture capitalists were throwing billions of dollars at the big data industry, which was expected to be worth $50 billion by 2020. Cloudera said it was worth $2 billion, while Hortonworks was the first Hadoop distributor to have an IPO.
We’re fifteen years into the new millennia, and we really are seeing a tidal shift in how people view and value data. Data is no longer a side effect of running a business; for many companies, data is the business. The total store of human knowledge is estimated to be about 8 zetabytes, a figure that’s expected to grow nearly four-fold by 2020. That huge number demonstrates the remarkable progress we have made describing our world, but there’s no end to the amount of data we can generate. Thanks to new technologies and computer architectures, at least we have a shot at harnessing the onslaught and using it.