August 15, 2022

AI for DevOps: Cutting Cost, Reducing Risk, Speeding Delivery

Alex Woodie

(Ashalatha/Shutterstock)

Organizations collectively spend billions every month on DevOps processes, yet bad code still makes it into production, causing downtime, additional time/money, and reputational harm. With so much at stake, it would seem to be a natural fit for automation through AI and machine learning. There’s at least one company developing it, but it’s probably not a name you would guess.

You don’t have to look very far to find evidence of DevOps disasters. CRN has a list of the biggest cloud outages so far in 2022, which includes big names like Google Cloud, Apple, and IBM. And who can forget the big Slack outage that occurred in February?

The underlying cause of all these outages are different. In some cases, it’s a network configuration error, in others, a database update gone bad. DNS errors remain commonplace, and fat fingers have yet to be banished from the IT kingdom.

But upon closer inspection, there is a common theme across many, if not the majority, of these stories: An erroneous change was moved to production when it shouldn’t have (we’ll give Google some slack on the severed undersea cable that impacted its service in June, but we’re wondering why Microsoft didn’t detect the power oscillations that caused fans to automatically shut down in an Azure data center sooner).

None of this is easy. Modern software development is extremely complex, and there are a thousand moving parts that must be synchronized. The process of moving software from development the production–which touches development and operations and is collectively termed DevOps–is rife with complexity and potential tripwires. The practice of letting tech professionals pick their own tools brings its own set of complications.

The average cost of application downtime is $5,600 per minute, according to Gartner (Gorodenkoff/Shutterstock)

The solution to this situation up to this point has been to throw lots of manpower at the DevOps problem. Developers, testers, deployment managers, and SREs spend many hours keeping track of various updates and configuration changes in the hopes that nothing gets by them. Some organizations have begun to move toward a set of standard tools to reduce complexity, but that hasn’t made much of a dent yet.

The folks at Digital.ai have a fundamentally different approach. Instead of relying on humans to spot problems or trying to force standardized set of tools in the DevOps or CI/CD (continuous integration/continuous delivery) realm, Digital.ai uses machine learning techniques to predict the likelihood that a given piece of new code or code update is going to cause problems.

According to Florian Schouten, the company’s vice president of product management, Digital.ai predictive solution starts by ingesting historical data from DevOps platforms and tools, such as Git, Jenkins, Azure DevOps, Jira, and ServiceNow. Digital.ai then feeds the data into classification algorithms, which detect patterns across those change events in the past.

“Most organizations have 3,000 to 5,000 changes a month that will feed into the model,” Schouten says. “It will capture all the aspects of those, let’s say, 5,000 monthly change events, such as who the team is, what infrastructure was changed, what testing was done during the software development cycle, who the developer or developing team was, how many defects were found during testing, and all these other environmental factors that then can be correlated to the success and failure of past changes.”

Once trained, Digital.ai’s algorithm can then be used to predict the likelihood that a current change will cause problems. The company’s offering can detect more than 80 risk factors out of the box, with a likelihood score generated for each one. The software development manager can use this to make decisions about the need for additional review before hitting the “go live” button.

“If it’s 1% [chance of causing a failure], OK, let it go. I’m not going to spend any time on it,” Schouten tells Datanami. “If it’s a 60% likely thing? I better take a look and route it to the right people for review.”

This approach can bring multiple benefits. For starters, the extra layer of scrutiny can help avoid an outage, which could have a devastating impact on an organization. It can also save money by making more optimal use of existing resources, “which on its own can pay for the solution and often be in the millions of dollars, given how many people tend to be involved and how many change events there are,” Schouten says.

Last but not least, Digital.ai’s solution–which requires about six to eight weeks of ML training–can also help speed up the delivery of new features in software products. “We hear organizations say they can release software five to 10 times faster with half the people involved,” Schouten says.

Speed is a recurring theme in today’s digital world. Most organizations are trying to deliver new software updates as quickly as their DevOps and CI/CD processes will allow (there’s a reason they call them “sprints” and not “crawls”). However, moving fast raises the possibility of making mistakes, which is where AI can help pick up the slack by providing better insight on which changes bring the bigger risk of errors.

Do you have a fast lane for code updates? (fotomak/Shutterstock)

“Most organizations are in some form of trying to establish, or have a desire to establish, what they call a software fast lane,” Schouten says. “They want to differentiate between small changes or very low risk changes that they should just let run through without oversight from some of those folks, whereas they want to then focus the expertise of these people on the areas where it’s most needed, thereby balancing the risk with going faster.”

This is a good example of how AI will not replace humans so much as help the existing humans do their jobs better. The world of enterprise software development is way too complex to hand over to AI, but human expertise is limited. Considering how important software development is to the multi-trillion-dollar IT industry–in fact, to all industries these days–this kind of solution makes a lot of sense.

“Gartner has made pretty clear that risk analytics, which I’m not too shy to say we spearheaded–Numerify is a company that joined DigialAI years ago–is…listed as a critical capability as part of what they call value stream management,” Schouten says.

What’s surprising is that more solutions of this type haven’t already been developed. Why haven’t the bigger players in the DevOps world delivered something like this? “They do have the data,” Schouten says. “To our knowledge they don’t yet [have this type of AI]. But I’m sure over time they will.”

Related Items:

AI Continues DevOps Expansion

AI-Enabled DevOps: Reimagining Enterprise Application Development

BI Startup Numerify Raises Another $15M

Applications: Artificial Intelligence

Technologies: Frameworks

Sectors: Financial Services

Vendors: Digital.ai, Numerify

Tags: application downtime, CI/CD, classification algorithm, cloud outages, DevOps, IT failure, machine learning, predictive analytics

Only registered users may comment. Register using the form below.

Check off newsletters you would like to receive*
- HPCwire
- EnterpriseTech
- Datanami
- Technology Conferences & Events
- Advanced Computing Job Bank
- Technology Product Showcase
Email*
Name*
First Last
Organization*
Job Function*
Industry*
Country*
City*
State*
Province*
- Please check here to receive valuable email offers from Datanami on behalf of our select partners.

April 25, 2024

April 24, 2024

April 23, 2024

Sponsored Partner Content

Get your Data AI Ready – Celebrate One Year of Deep Dish Data Virtual Series!

Supercharge Your Data Lake with Spark 3.3

Learn How to Build a Custom Chatbot Using a RAG Workflow in Minutes [Hands-on Demo]

Overcome ETL Bottlenecks with Metadata-driven Integration for the AI Era [Free Guide]

Gartner® Hype Cycle™ for Analytics and Business Intelligence 2023

The Art of Mastering Data Quality for AI and Analytics

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Top 6 Strategies for Reducing Data Warehouse Costs

Building an Operational Data Warehouse for Real-time Analytics

Sponsored Multimedia

The Power of DataOps: Bring Automation to Life
No Comments

Tactical Steps for Cloud Migration
No Comments

Immuta Data Access Platform
No Comments

Data Mesh: Fact or Fiction?
No Comments

Contributors

Featured Events

AI & Big Data Expo North America 2024

CDAO Canada Public Sector 2024

AI Hardware & Edge AI Summit Europe

AI Hardware & Edge AI Summit 2024

CDAO Government 2024

AI for DevOps: Cutting Cost, Reducing Risk, Speeding Delivery

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

April 25, 2024

April 24, 2024

April 23, 2024

Most Read Features

Most Read News In Brief

Most Read This Just In