V is for Big Data Virtualization
You’ve undoubtedly heard the (tired) definition of big data involving the three “Vs” for volume, variety, and velocity. Now there’s a growing movement to associate a fourth V-word with big data sets: virtualization.
For years, virtualization has been a buzzword in the data center, where virtualization software, such as a hypervisor, is used to carve up a single operating system image into multiple OS images. Similar approaches are used to provide more fine-grained manageability of storage and network resources.
But virtualized data — that’s a new one! How in the world can one create a “virtual” copy of data? And how would a virtual version of data look or behave any different than the original?
According to the folks at the data virtualization software company Denodo, data virtualization involves separating the physical source of data from its potential applications, with the goal of increasing business agility. It’s a twist on the role of master data management (MDM), which is a must whenever one works with datasets of a certain size or diversity.
Instead of physically grouping and storing datasets together — as in one big giant database or data warehouse stored on a big wonking server or cluster — virtualized data is stored in a distributed manner, across a series of disparate servers.
A logical, unified, and federated view of the data is then created (using the data’s metadata), providing streamlined access (via Web services or straight SQL) to data objects by business intelligence tools, dashboards, portals, and other data-consuming applications. It’s another take on the MDM approach that is gaining attention as data volumes proliferate with the Hadoop-driven “everything plus the kitchen sink” approach to data storage and processing.
Denodo says its data virtualization software can alleviate the need to first consolidate data before doing anything with it, which can save a lot of time and effort. According to The Data Warehouse Institute, it takes an organization nearly eight weeks to add a new data source to their data warehouses.
Some of the pros and cons of data virtualization were discussed in a recent “BI Leadership” report written by TechTarget on behalf of Denodo. The report, titled “Data Virtualization: Perceptions and Market Trends,” notes that data virtualization software has been around for more than two decades, but that it suffered from performance and scalability problems. Thanks to today’s powerful servers, performance and scalability are no longer major issues, writes Wayne Eckerson, Director of Research for the Business Applications and Architecture Media Group at TechTarget, and the author of the report.
Freeing up the time and political will to implement a “data service” are, however, possible roadblocks to implementing data virtualization software. Eckerson notes that setting up a data virtualization environment requires careful upfront modeling of data. “This requires businesspeople to come to consensus on the meaning and definition of key data elements, and also requires technical people to corral these definitions into a linked model of the organization that makes sense to business users,” he writes.
As data volumes (the first “V” in big data) continue to grow, the benefits of breaking the one-to-one relationship between data and its place on the platter will outweigh the downsides, Eckerson writes. “Most companies have given up the notion that they can populate all their data into a data warehouse,” he says. “The value of creating a logical view of distributing data is gaining mindshare. At the same time, advances in hardware eliminate performance and scalability concerns.”
Despite the advantages, few organizations have adopted data virtualization, and it’s still in the early adoption phase. Although some vendors use data virtualization in their wares–business intelligence and ETL vendors topping the list — outright adoption of data virtualization is still rare among end-user organizations.
“Although this embedded approach enhances these applications with query federation capabilities, it doesn’t provide a universal data access layer that glues together data throughout an organization,” Eckerson writes.