Corralling VM Complexity with Machine Learning
The widespread use of virtual machines (VMs) makes tracing performance problems in today’s enterprise applications a daunting task. Traditional tools designed to track slowdowns on physical hardware are no match for today’s fast-changing VM infrastructure. Now a software company called SIOS Technology is using machine learning algorithms and predictive techniques to cut through the VM complexity and give harried admins a jumpstart to resolving performance issues.
The huge popularity of hypervisors like VMWare‘s ESX Server and Citrix‘s XenServer have had a fundamental impact on how IT operations do their jobs. Instead of running one Windows or Linux application on one Intel-based server, organizations now stack a dozen or more operating environments on each physical server, thereby ramping up the workload efficiency of the servers to levels typically only seen with big hosts like IBM mainframes and Power Systems servers.
But that newfound computing density has created some unanticipated consequences. Because the hypervisors provide an abstraction level atop the underlying compute, storage, and networking resources, it can be hard knowing exactly what underlying resources the applications are using. And because of that, it can throw IT administrators for a loop when tracking down performance problems.
“There’s just so many moving pieces,” says Jerry Melnick, COO of SIOS Technology, a longtime provider of high availability (HA) software for Intel-based servers. “When you move them onto a virtualized environment, the complexity goes up an order of magnitude, if not more, because now the pieces are interdependent on the hardware resources underneath them.”
When problems crop up–as they inevitably do–IT staffers must try to find the cause. What typically happens is the application person says his software is working fine, the SAN guy says the storage is working normally, and the network guy says the network is fine, Melnick says. “So it becomes a fingerprinting operation,” he says. “It’s not entirely clear who’s doing what when you have 30 servers and 2,000 virtual machines.”
SIOS has heard this problem over and over from its customers, and so it decided to do something about it. First it hired Sergey A. Razin, a Ph.D. who’s used his machine learning expert at companies like Kaspersky and Avaya. As CTO, Razin led development of the SIOS iQ technology, which it formally unveiled today.
According to Melnick, the software is using big data technology—including graph analytics and machine learning–to track how various VMs and applications consume resources like CPU, memory, storage, and network I/O, and ultimately detect patterns that are indicative of applications gone bad.
“We’re looking to find correlations and patterns of behavior in the system that are predictive of abnormal behavior,” Melnick says. “Machine learning is about putting tougher all the pieces–first finding who’s associated with whom, who’s connected to whom—and then when something happens, try to identify the pattern of behavior and whether it’s abnormal.”
The software uses various unsupervised and semi-supervised machine learning algorithms to figure things out about each customer’s system. “Over time we can understand that these three VMs, storage, and network are connected and they always react under certain automatically identified thresholds,” Melnick says. “When we see a deviation over time for that group, we understand that’s an anomaly, that’s aberrant behavior, and that needs to be called out. Once we have that being called out, we can link that to events that occurred, and ID what happened and where you need to look.”
The software bubbles up the alerts to SIOS iQ’s PERC dashboard, which tracks Performance, Efficiency, Reliability, and Capacity. You can think of those as indicator lights in a car, Melnick says. “They represent at a high level, across the infrastructure that we’re analyzing, an aggregate goodness and deviation of goodness over time.”
The software doesn’t actually help IT staffers fix the underlying issue—they still need to use the various performance detection tools and techniques they have traditionally used. But by pointing the customer in the right direction—say by showing that a certain VM is suddenly consuming an inordinate amount of CPUs in certain conditions—the IT staffers can get a jump on the problem resolution process. Melnick says the software will save customers at least a day per event.
The software currently only supports VMWare’s ESX Server hypervisor, but the company plans to expand that. It also supports Microsoft’s SQL Server database, although the company wants to support additional databases and applications (such as ERP systems from SAP or Oracle) in the future.
SIOS isn’t the only vendor applying machine learning technology to complex IT environments. But it is at the leading edge of a trend that’s surely to expand, particularly as adoption of cloud computing rises and the VM footprint in data center gets bigger. “People are just starting to use it,” Melnick says. “As they use data mining, big data, and machine learning approaches in other domains, it’s just coming into the IT domain.”