Follow Datanami:
August 21, 2017

Cutting through the APM Complexity with Data

Bill Emmett

Today’s IT is disjointed and more complex than ever before. Application performance is a critical measure of customer experience, and is now defined by various elements including cloud infrastructure, software, APIs, microservices and network performance. But these variables reside outside the core application. Application performance management (APM) alone, while still valuable, is no longer enough to provide a 360-degree view. We need greater speed to support the complex and ever-changing application requirements of businesses. But at the same time, we need to make sure applications don’t break and remain secure.

The stakes are higher than ever — IHS reports information and communication technology (ICT) downtime is costing North American organizations $700 billion per year. What’s worse, repeated outages threaten business credibility and viability. Rapidly delivering new IT services requires real-time stakeholder engagement to provide the information they need to make decisions quickly and with confidence. With a data platform approach to application management, managers can share visibility to a broader set of stakeholders. Companies are embracing modern application architectures to operate with speed, and use a data-driven strategy that empowers all stakeholders.

APM alone is no longer enough

Business success is largely driven by offering new and different applications at web-scale ahead of the competition. To achieve this, it is critical to reduce lag time between business decisions and IT’s implementation. But achieving this is not easy. APM solutions must consider app complexity, emerging architectures and disjointed IT infrastructure, as they all bring new challenges to the table. When it comes to addressing these issues, APM-specific tools can only detect application-related problems with availability and performance. Unless the problem is directly related to application code, managers cannot identify the root cause.

(Timofeev Vladimir/Shutterstock)

Many cases, the problem resides in the infrastructure, not in the application’s logic. That’s why many consider configuration logs, automation tools and configuration-state changes as the “the smoking gun” for where problems may have been introduced. They then spend countless hours on troubleshooting to find out what actually occurred. As an example, one software solutions provider for independent physician practices closely monitors the delivery of its services to approximately 15,000 users, who log into its systems daily. The firm’s IT staff must make sure applications are always available and the supporting infrastructure can meet demand. To do this, the company employs a top-down monitoring approach to rapidly search for data in the application stack. This enables them to effectively detect and remediate issues before they become real problems for customers.

As web-scale IT becomes a new mandate, new architectures and approaches are required to support it. Moreover, businesses have an obligation to deliver applications quickly and iterate often.

Finding answers in the data silos

Many new technologies like microservices, mobile, “Internet of Things” are infiltrating the application stack, making it nearly impossible to monitor successfully. Application and operations support teams need to know what’s under the hood of their apps. They also need to know how their applications interact with the underlying infrastructure, as well as gain insight into usage trends.

(Aleutie/Shutterstock)

Unless application managers have a data platform that can collect, index, correlate and provide analytics across a broad array of management data sources, it’s difficult to spot problems quickly and isolate their source. Developers need insight into how their applications perform in production to build better performing, more reliable apps. When problems are app related, analyzing logs in production is critical. Lines of business are accountable as well, and need some insight on how application performance affects customer experience, revenue and costs.

Just as IT needs a management tool that addresses overall service availability and performance, application managers need a data platform that bridges the silos — incorporating and analyzing data from the various sources that influence performance and availability. To do this, the data platform must collect, index, store and analyze data to focus on event sequences or even individual data points.

This approach offers complete visibility into the apps — enabling you to monitor application performance, troubleshoot problems and analyze applications, resulting in improved future releases. Searches enable effective troubleshooting, as you quickly spot “the needle in the haystack.”

Data analytics provide the ability to connect the work that developer and operations teams do with outcomes. For example, using analytics, organizations can not only monitor and troubleshoot applications, they can gain a broader view whether the most recent release of software was more or less reliable than it’s previous release. Furthermore, performance engineers and site reliability engineers can can use the data to examine how new application releases impacted resource consumption, and recommend changes either in infrastructure or the application itself.

Correlation is key. Correlating events allow IT to move from spotting symptoms to establishing root cause. “Connecting the dots” is key.

Diagnosing performance problems requires correlating conditions across a range of metrics

For example, application usage and insights is useful to operations and developers, but also useful to other lines of business. Sometimes, problems in applications pertain to security. Having a common data platform and working from shared “evidence” helps accelerate investigations. While metrics, logs and data coming in from other tools are all valuable for monitoring, most often log files become the most authoritative source of data when performing detailed root-cause analysis and troubleshooting.

Sometimes, not even dashboards and well-designed searches are enough. Machine Learning represents a great opportunity to sift through large amounts of data to find anomalies, trends and correlations that could not be anticipated. By applying machine learning a population of data, root cause for problems can be more readily surfaced. Also, machine learning allows a far more sophisticated way to separate normal from abnormal performance trends. Combined, machine learning provides ways to allow people to focus on the real problems – whether they are in your infrastructure or application code – rather than spend time combing through data and dashboards.

Monitoring complex applications requires gathering data from many sources, and presenting it in a way that helps you better understand service-level performance and spot problems. However, moving from monitoring to troubleshooting, application managers need to be able to ask any question of their data. This requires a data platform with the ability to collect, index and retain raw data over an extended time. Inherent in the ability to “ask any question” is the ability to see across complex environments — networks, systems, containers and virtual machines, application tiers, APIs, microservices, databases, load balancers, cloud services, firewalls, power, HVAC and storage — to spot trends, problems and anomalies in applications. Application management must correlate events across time, users, data sources, location and transactions, and should include the ability to further explore application management data.

About the author: Bill Emmett is currently Director, IT Operations Product Marketing at Splunk. In this role, Bill focuses on Go-to-market strategy and programs for infrastructure and application management use cases and engages with customers, industry analysts, technology and channel partners on a regular basis. Bill frequently speaks at IT industry events and is active in the IT Management community. Bill has an MBA and Bachelors Degrees in Accounting and Computer Information Systems at Colorado State University and currently resides in Denver, Colorado. Bill has been with Splunk for three years. As a 20+ year veteran in the IT Operations Management market, Bill has held various marketing, IT, and R&D roles with HP Software and BMC Software.  You can connect with Bill via Twitter at @billemmett000, through LinkedIn or on his blog at splunk.com .  

Related Items:

How DevOps Can Use Operational Data Science to See into the Cloud

Before the Next Disaster Strikes, Get Better at Data Science

Datanami