The Third Age of Data
The Third Age of Data has arrived. Today, an estimated 1 trillion sensors are embedded in a nearly limitless landscape of networked sources, from health monitoring devices to municipal water supplies, and everything in between. The massive amounts of data being generated hold the promise of ever-greater insight, but only for those who successfully ingest, process and harness the flood of information. Now more than ever, scalability and real-time analytics have become essential for companies who want to meet business demands and stay ahead of the curve.
In order to understand how we came to generate so much information, let’s rewind 20 to 30 years ago to the First Age of Data. Traditional IT infrastructure was designed around data that was predominantly created by humans, including emails, documents, business transactions, databases, records, and the like. This data was primarily transaction-oriented, consisting of back-office databases and once-a-day batch processing. For example, if a bank generated a customer statement, it would be conducted through a mainframe, stored in a traditional database, transferred across a storage area network, and eventually end up in your mailbox. Many large, well-established companies made a name for themselves during this era of data.
The Second Age of Data was produced overwhelmingly by humans. This includes the vast digital trails left behind by each internet and smartphone user on the planet. It is also characterized by a massive explosion in content like office documents, streaming audio/video, digital imaging and photography, email, and websites. This growth in the amount data being generated in addition to variety in file types, formats, and sizes led to new demands in storage requirements. During this time, many pioneering vendors from the previous transactional age were subsequently replaced by scale-out companies that could meet the new need for scale.
We’ve now entered the Third Age of Data, which refers to the enormous amounts of information being generated by machines with the type of volume, variety, and velocity that the world has never seen before. Data is being generated by sensors, imaging, data capture, logging, monitoring, and more. Harnessing the Third Age of Data, although a challenge, provides a wealth of opportunities for enterprises to discover new insights from complex systems. However, it is only of value to those who can efficiently understand and manage large-scale data storage and processing.
Here’s a breakdown of the changes and challenges we’re seeing with increasing volume, variety, and velocity of data.
Machine Data Volume
Embedded sensors in automobiles and roadways now supply information regarding the location, speed, direction, and operation of a given vehicle. Insights from this data provide everything from better traffic management to vehicle monitoring, routing, and even entertainment. Similarly, network packet, traffic, and call and log monitoring provides insight into service operations and security, keeping IT datacenters and telecommunications networks safe and sound. The scale of this sensor and packet data is already massive and growing every year.
Machine Data Variety
We’re also witnessing an unprecedented variety in types of data. In some industries, companies rely on constant, tiny measurements from ground or equipment-mounted sensors and devices in addition to huge and complex satellite imagery, weather models, geospatial data, etc. Many companies have a mix of data sources (different machine systems, humans, etc.), with corresponding differences in data size and type. In either case, systems and storage optimized for one end of the spectrum may not be able to readily handle the other, making scale-out systems essential.
Machine Data Velocity
Finally, machine data is being produced at an ever-growing velocity. Sensors, satellites, networked systems, and connected vehicles all have one thing in common: they never sleep. These machines typically operate on the basis of continual measurement (24x7x365), constantly streaming data that must be processed and stored. Moreover, the flood of data can quickly and sometimes unexpectedly spike. In life sciences, for example, large-scale systems or teams may rapidly generate tens of millions of files – or multi terabyte-size models – in only a few hours. Keeping up with that data load, and more importantly, understanding its constant ebb and flow, is an equally massive challenge.
“The Human Face of Big Data,” a PBS documentary, states that we now generate as much data every two days as the world generated from its inception through 2003. An oft-quoted industry truism holds that every day, we create 2.5 quintillion bytes of data and that 90 percent of the world’s data has been created in the last two years alone. The ever-rising tide of machine data will only further accelerate these numbers.
As with any major change, there will be winners and losers. When it comes to the Third Age of Data, enterprises need to scale-out if they don’t want to become obsolete. IDC chief analyst Frank Gens states that scale “is the critical ingredient in the unfolding battle for digital success.”
In order to keep up with these new demands, companies should ask themselves the following questions: How are we going to manage this onslaught of data? Where is the data going to go? How are we going to process the raw data so that we can understand it and gain actionable insights? How do we feed said insights into the next generation of products and services being created?
So, what does the playing field look like in the Third Age of Data?
The Connected Car
Every car is becoming a data generator on wheels, accessing systems within the vehicle as well as the driver’s cell phone, and transmitting that data to other systems, which can range from the automaker (for monitoring the vehicle’s performance) to highway departments (for traffic monitoring). It’s relatively early days for the connected car, but before long, an array of smart networked applications in cars will turn “dumb” cars into dinosaurs. Strategy&, the strategy consulting team at PwC, predicts that 90% of vehicles will have built-in-connectivity platforms by 2020. According to research firm Analysys Mason, the number of connected cars will grow to more 150 million this year and more than 800 million by 2023.
General Electric (NYSE: GE) and Cisco Systems (NASDAQ: CSCO) have projected that by the end of this decade, at least 1 trillion sensors will be deployed as part of the IoT, representing a $15 trillion market by 2020. Gartner estimates there will be 4.9 billion connected ‘things’ this year, soaring to 25 billion by 2025.
For example, the IT team at a major U.S. university’s independent health research center found itself with little visibility into its massive store of trending data on the global impact of over 400 different diseases in over 180 countries. Their compilation and analysis of these global health statistics was generating tens of millions of files in a single afternoon. Just keeping up with that massive data growth and understanding its ebb and flow was a great challenge.
By implementing a scale-out storage system with real-time analytics, the center was able to gain both the scalability and operational visibility necessary to efficiently store and conduct their life-saving research.
Another example is a top U.S. telecommunications provider that gathers the log data from all of its network endpoints around the world. This immense volume of log data is ingested into a centralized storage tier where it is then analyzed via a mix of tools – including Splunk (NASDAQ: SPLK), Hadoop and some internally-developed applications – to gain actionable insights into the activity going on around the world in their network.
Or take Vaisala, which performs weather modeling and forecasting services for planning, assessment and deployment of renewable energy systems. Working with everything from in-the-field sensor data to the advanced weather and climate models of national and international weather services, the company helps clients project potential solar, hydro and wind power generation 10 minutes to 30 years into the future.
All of which creates the data processing and storage challenge of efficiently managing a vast volume of tiny sensor measurements, combined with huge and massively complex forecasting models, to generate meaningful assessments on the ground and above it. The size of Vaisala’s simulations range from a cube of space across a rectangle of land all the way up to the clouds – and then over decades of time.
It’s a mind-boggling data challenge, but one Vaisala can handle because it put in place a massively scalable system with the ability to ingest, store, and analyze raw data that is being continuously generated.
Why Use Scale-Out Storage?
While machine data may grow to be a flood, it usually starts as a seemingly-manageable stream. IT teams often address that trickle by using network attached storage (NAS) to create a central storage repository, ideally with a scale-out design that provides for expansion without the need for multiple namespaces.
But, as that stream of machine data becomes a torrent, analyzing and managing the data becomes increasingly difficult. The problem becomes distinguishing hot from cold data, understanding capacity and performance impact by user or application, and precisely forecasting when capacity will run out. Running reports and analysis using traditional scale-out storage technologies can take days, weeks, or sometimes months, creating an endless cycle of reporting which can impact system performance and assessments that are out-of-date before they’re even complete.
As the Third Age of Data begins, one can only imagine the stunning advances that lie ahead. Managing and deriving value from this data will require the right strategies, scale-out technologies and smart investments to bring this new age to fruition.
About the author: Jeff Cobb is Vice President of Product Management at storage provider Qumulo. Prior to Qumulo, Jeff was SVP of Strategy and a Distinguished Engineer at CA Technologies. As Chief Scientist at Wily Technology, the application performance management pioneer, Jeff was the architect of the flagship product Introscope. Jeff built a Java virtual machine at Connectix, and designed runtime architecture for the MacOS at Apple. Jeff holds an A.B. from Dartmouth College.