Can Big Data Tame the Chaos of Virtualized IT?
The “software-defined” revolution is driving private data centers toward AWS-like efficiency. However, the virtualization of hardware, storage, and networking—not to mention agile coding techniques and a rapid-fire “DevOps” culture–also makes it much more difficult for IT professionals to track down problems. Now some are turning to big data tech to help correlate the various logs and ultimately tame the rising IT chaos.
Visit your friendly neighborhood data center, and you’ll see rack upon rack of servers and storage arrays, all with blinking lights and cabling running everywhere. It’s quite an impressive thing to get a behind-the-scene glimpse of the automation beyond corporate America, to stand in the boiler room of our digitized society.
However, hidden under the impressive hardware is a growing management problem that’s exacerbated by the “software-defined” virtualization trend and near-constant iterations of applications through devotion to DevOps. A data center operator specialist may monitor CPU, memory, and storage utilization for an 8-node Linux cluster, and the administrator sees only the specs of the 20 Docker images he’s managing, while the programmer only cares about the JVM runtime that his Web apps run under.
A Broken Stack
The whole IT stack has gotten much more complicated over the past decade, and the traditional approach to systems management is broken, according to Bernd Harzog, a systems management industry veteran and the founder and CEO of Atlanta-based OpsDataStore.
“Thirty year ago, we would change software in production once a year because it always created problems,” he says. “Now it’s changed daily, hourly, even every minute. Its’ written in more language than it used to be. And of course everything’s been virtualized.”
The traditional IT management vendors—often dubbed the Big Four of IBM, HP, CA, and BMC—are powerless to keep their frameworks up to date against the pace of relentless change, he says. In fact, even 20 years ago, the pace of innovation was too high for them to keep up with. “They actually never got close to delivering an integrated solution,” Harzog says.
The result is IT management today is a best-of-breed affair. The typical company today runs about 20 to 30 different IT management tools to monitor various pieces of their stack, including the networks, the servers, the storage, the Web servers, the databases, and so on. Some companies run hundreds of different tools.
This best-of-breed approach results in every little thing being instrumented to the n-th degree. But unfortunately, none of the tools were designed to talk to each other, so a specialty monitoring tool may not have visibility into congestion in the network or a growing problem with the storage array.
Harzog developed OpsDataStore to be the glue that brings all of the best-of-breed products together. “Our proposal and our contention is everybody should specialize in what they do best,” he says. “We’re going to be the data plane to knit it all tougher.”
The company built its data plane using (you guessed it), open source technologies.
Apache Kafka serves as the message bus that funnels logs in from the various point vendors. Apache Cassandra serves as highly scalable repository for storing the logs. And Apache Spark provides the compute power to analyze the data and track down problems. All these products are pre-built into the OpsDataStore product, which itself is a clustered application that requires nine separate nodes (they can run on virtual servers, naturally).
Harzog’s secret sauce lies in the object model used by the product, and the graph-like topology mapping layer.
The object model is key because it provides a standard data structure that all of the other vendors can output their logs in. A log management or application performance management (APM) vendor can partner with OpsDataStore and, without spending a dime, get access to the software development kit (SDK) that lets them output the logs in the right format.
OpsData Store has partnerships in place with VMware, Intel, AppDynamics, ExtraHop, and Dynatrace, which have all committed to output logs using that data model. Harzog, who has done consulting work with all of them, says the vendors are eager to work with OpsDataStore.
The other critical piece is the topology mapping layer, which is patent-pending. The topology mapping layer essentially allows a user to pull up all the relevant logs for a particular problem, such as slow transaction processing, without knowing in advance exactly what to look for.
That’s different than how other log management tools work. To pull up the data in a competing tool, Harzog says the user must include in his query the names of the transaction, the application, the virtual server, the physical server, the database, and the time range.
“With OpsDataStore you don’t have to know ahead of time what relates to what,” he says. “With OpsDataStore you simply say, ‘Here’s the transaction. Give me the things–the logs or the metrics–that relate to this transaction.’ So that relationship structure is basically reflected in our data model and it’s related in our Kafka topics.”
It’s essentially a graph database, which Harzag says is now necessary to elevate administrators above the growing complexity inherent in a best-of-breed approach to IT management. “It’s never been possible to analyze the behavior of the transactions in the context of their supporting infrastructure, because no one’s ever had a graph to relate it to,” he says.
Results of queries against the graph can be visualized in a dashboard provided by OpsDataStore, or output via REST API to BI tools like Excel, Qlik, or Tableau, where customers can further manipulate the data.
OpsDataStore is still ramping up.The company doesn’t yet support cloud-based applications, Microsoft HyperV, or APM provider New Relic. Those are all the roadmap, Herzog says. Hadoop clusters are also difficult to monitor through APM tools, but can be visible to OpsDataStore via Codahale.
Since the single-stack approach to IT management is broken, the best-of-breed approach will be the standard for the foreseeable future. That means OpsDataStore will rely heavily on partners to use the SDK to build the connectors and output log data in the company’s object model format.
“Our contention is it’s actually not possible for a single vendor to deliver a fully functional integrated solution,” he says. “No single vendor can try to cover the waterfront here.”
Harzog says that several years ago, Splunk was headed down a similar path it’s now taking, and was aggressively partnering with vendors to get them to work with the log management vendor. However, the company changed its tactic and then started competing with its partners, which Harzog says was a mistake.
OpsDataStore won’t compete with its partners. “When I tell them I’m going to build a business that adds value to them and doesn’t compete with them, they believe me,” he says. “We have the standing in the industry and the reputation in the industry to pull this off.”
Pulling it off won’t be easy, but it will be interesting to see if they can. After all, somebody had to try.