Follow Datanami:
August 15, 2022

AI for DevOps: Cutting Cost, Reducing Risk, Speeding Delivery

(Ashalatha/Shutterstock)

Organizations collectively spend billions every month on DevOps processes, yet bad code still makes it into production, causing downtime, additional time/money, and reputational harm. With so much at stake, it would seem to be a natural fit for automation through AI and machine learning. There’s at least one company developing it, but it’s probably not a name you would guess.

You don’t have to look very far to find evidence of DevOps disasters. CRN has a list of the biggest cloud outages so far in 2022, which includes big names like Google Cloud, Apple, and IBM. And who can forget the big Slack outage that occurred in February?

The underlying cause of all these outages are different. In some cases, it’s a network configuration error, in others, a database update gone bad. DNS errors remain commonplace, and fat fingers have yet to be banished from the IT kingdom.

But upon closer inspection, there is a common theme across many, if not the majority, of these stories: An erroneous change was moved to production when it shouldn’t have (we’ll give Google some slack on the severed undersea cable that impacted its service in June, but we’re wondering why Microsoft didn’t detect the power oscillations that caused fans to automatically shut down in an Azure data center sooner).

None of this is easy. Modern software development is extremely complex, and there are a thousand moving parts that must be synchronized. The process of moving software from development the production–which touches development and operations and is collectively termed DevOps–is rife with complexity and potential tripwires. The practice of letting tech professionals pick their own tools brings its own set of complications.

The average cost of application downtime is $5,600 per minute, according to Gartner (Gorodenkoff/Shutterstock)

The solution to this situation up to this point has been to throw lots of manpower at the DevOps problem. Developers, testers, deployment managers, and SREs spend many hours keeping track of various updates and configuration changes in the hopes that nothing gets by them. Some organizations have begun to move toward a set of standard tools to reduce complexity, but that hasn’t made much of a dent yet.

The folks at Digital.ai have a fundamentally different approach. Instead of relying on humans to spot problems or trying to force standardized set of tools in the DevOps or CI/CD (continuous integration/continuous delivery) realm, Digital.ai uses machine learning techniques to predict the likelihood that a given piece of new code or code update is going to cause problems.

According to Florian Schouten, the company’s vice president of product management, Digital.ai predictive solution starts by ingesting historical data from DevOps platforms and tools, such as Git, Jenkins, Azure DevOps, Jira, and ServiceNow. Digital.ai then feeds the data into classification algorithms, which detect patterns across those change events in the past.

“Most organizations have 3,000 to 5,000 changes a month that will feed into the model,” Schouten says. “It will capture all the aspects of those, let’s say, 5,000 monthly change events, such as who the team is, what infrastructure was changed, what testing was done during the software development cycle, who the developer or developing team was, how many defects were found during testing, and all these other environmental factors that then can be correlated to the success and failure of past changes.”

Once trained, Digital.ai’s algorithm can then be used to predict the likelihood that a current change will cause problems. The company’s offering can detect more than 80 risk factors out of the box, with a likelihood score generated for each one. The software development manager can use this to make decisions about the need for additional review before hitting the “go live” button.

“If it’s 1% [chance of causing a failure], OK, let it go. I’m not going to spend any time on it,” Schouten tells Datanami. “If it’s a 60% likely thing? I better take a look and route it to the right people for review.”

This approach can bring multiple benefits. For starters, the extra layer of scrutiny can help avoid an outage, which could have a devastating impact on an organization. It can also save money by making more optimal use of existing resources, “which on its own can pay for the solution and often be in the millions of dollars, given how many people tend to be involved and how many change events there are,” Schouten says.

Last but not least, Digital.ai’s solution–which requires about six to eight weeks of ML training–can also help speed up the delivery of new features in software products. “We hear organizations say they can release software five to 10 times faster with half the people involved,” Schouten says.

Speed is a recurring theme in today’s digital world. Most organizations are trying to deliver new software updates as quickly as their DevOps and CI/CD processes will allow (there’s a reason they call them “sprints” and not “crawls”). However, moving fast raises the possibility of making mistakes, which is where AI can help pick up the slack by providing better insight on which changes bring the bigger risk of errors.

Do you have a fast lane for code updates? (fotomak/Shutterstock)

“Most organizations are in some form of trying to establish, or have a desire to establish, what they call a software fast lane,” Schouten says. “They want to differentiate between small changes or very low risk changes that they should just let run through without oversight from some of those folks, whereas they want to then focus the expertise of these people on the areas where it’s most needed, thereby balancing the risk with going faster.”

This is a good example of how AI will not replace humans so much as help the existing humans do their jobs better. The world of enterprise software development is way too complex to hand over to AI, but human expertise is limited. Considering how important software development is to the multi-trillion-dollar IT industry–in fact, to all industries these days–this kind of solution makes a lot of sense.

“Gartner has made pretty clear that risk analytics, which I’m not too shy to say we spearheaded–Numerify is a company that joined DigialAI years ago–is…listed as a critical capability as part of what they call value stream management,” Schouten says.

What’s surprising is that more solutions of this type haven’t already been developed. Why haven’t the bigger players in the DevOps world delivered something like this? “They do have the data,” Schouten says. “To our knowledge they don’t yet [have this type of AI]. But I’m sure over time they will.”

Related Items:

AI Continues DevOps Expansion

AI-Enabled DevOps: Reimagining Enterprise Application Development

BI Startup Numerify Raises Another $15M

 

Datanami