How DevOps Can Use Operational Data Science to See into the Cloud
As system architecture moves to the cloud, understanding the impacts of new code releases and points of failure requires a whole new approach. The increase in agility, bandwidth, and security has been matched by an increase in complexity. DevOps strategies need intelligence from real-time operational metrics to keep a product in optimal condition. Effective system maintenance requires visibility into a maze of interdependent operations. The need to direct DevOps efforts with dynamic operational intelligence has led to the emergence of a hybrid field: Operational Data Science.
Ask any CTO or CIO about DevOps and data science, and they’ll say that smart enterprises are investing in expertise for both skill sets. The DevOps approach has made IT more responsive to business needs, helping departments improve faster and more collaboratively. Big data just keeps getting bigger, with specialists in data science popping up in more and more industries as leaders realize that understanding their market, customers, and product requires crunching a lot of numbers.
Until recently, these areas have not seen much overlap. Data scientists are typically hired to look at business intelligence, not operational metrics. Operational Data Science (ODS) is meeting the challenge of the increased complexity of operations and the need for deeper and faster visibility.
Implementing ODS means analyzing all available performance metrics, in real-time, to determine the health of the system and identify problems before they become disasters. The payoff is total visibility into system processes, directing DevOps straight to the problem, reducing downtime, and optimizing performance. The trick is finding the right people and tools to get the job done.
A concrete example of the need for ODS is the inadequacy of log monitoring. The traditional method of translating metrics into logs—and then searching those logs—can’t keep pace with the many ways a cloud-based infrastructure can fail, and is prohibitively expensive/slow at enterprise scale. By streaming metrics from every corner of the cloud instead, and correlating for connections, a DevOps team can investigate issues in real time without filtering their data or taking on unnecessary latency in their time-to-recovery.
The ability to investigate billions of points of interrelated data is not trivial, both in terms of computing power and technical expertise. While ODS can provide invaluable visibility to DevOps and the products they support, building those capabilities takes training in data science. There are three challenges for a DevOps team preparing to work on ODS:
- Scaling data collection and analytics tools
- Building knowledge of statistical methods
- Condensing many metrics streams into actionable insight
Scaling Data Collection and Analytics Tools
The first road-block is scalability— finding a platform that can process a million metrics per second and perform analyses on the spot.
Graphite is an open source tool that collects time-series data, stores it, and displays it. It has all the bases covered, but it doesn’t get rave reviews. It buckles if you send too much data, stores data inefficiently, and eats up server space. Installing and maintaining Graphite is also expensive, both in terms of hardware and human hours. However, the customization can be a perk and Graphite is a common tool for operations analytics. Commercial solutions also exist that can handle larger data intake without a cumbersome integration process. Graphite also lacks alerting capabilities, which most commercial solutions have built in.
One feature that should not be compromised on an analytics platform is centralization. Consolidating data in one place allows for the discovery of unexpected connections. If the network team is looking at a network monitoring platform and the engineers at an APM platform, the DevOps effort is hamstrung by partial information. Metrics monitoring should enable collaboration; otherwise, the efficiency payoff will be diminished.
Building Knowledge of Statistical Methods
Many DevOps workers don’t have a strong background in statistics, so it’s a challenge to provide custom, self-serve analytics at an accessible level.
One way to overcome a lack of expertise is to shift some of the heavier lifting of data processing to an automated system. Some compromise must be made to maintain the customization capabilities of metrics analytics, but machine learning libraries in Python or R provide relatively accessible ways to do complex analysis. The time it takes to bridge the skill gap is negligible compared to the time savings of efficient detection and resolution of infrastructure failures. In the future, more of these capabilities will be available within analytics platforms.
Even the most advanced statistician will have trouble defining “normal” in a summary of metrics data. Even the most advanced platforms can’t automatically differentiate good from bad within a stream, or anomalies from artifacts. As technology develops, machines will get better at self-diagnosis. Until then, isolating the direct impact of variables on performance metrics is the best way to determine whether a system is optimized, and from there an understanding of “normal” can be developed.
Condensing Many Metrics Streams into Actionable Insight
As the skill gap problem demonstrates, the dimensionality of operational data is both a blessing and a curse. Countless insights are hidden within a mass of metrics data, but it takes a practiced hand to sift the value out of the deluge of information.
Because the deluge of data coming from a dynamic cloud infrastructure is so varied, alerting on anomalies is very difficult. An easily detectable sudden spike may be perfectly normal, while a gradual build is cause for concern and harder to pick out. The difference between artifacts and distress signals needs to be understood before such a system can yield practical value.
Operational Data Science requires an investment in training partially because it is very flexible, and every application will require a slightly different approach. That flexibility is what makes operational analytics so useful for DevOps—it informs every decision in system maintenance and upgrades. The visibility that ODS creates allows quick identification of how a product is working.
At Wavefront, we used real time metrics analytics to decide whether Amazon’s Elastic File System would suit our needs. It would have been costly, time consuming, and risky to do a full blown trial. Instead, one of our engineers signed up for a trial and wrote some test programs to gather data on the speed of the service. In under 20 minutes, the engineer found that EFS was optimized for a small number of large files, when we would have used it for a large number of small files. Without directly analyzing the product’s metrics, it could have taken days or weeks (and a much more expensive engineering effort) to determine that our specific needs would require a different service.
Metrics monitoring can also uncover problems in novel architectures. IoT is an emerging sector that takes the notion of “distributed system” to a physical extreme. ODS is the most direct way to make remote assessments of degradations in IoT networks.
For instance, Clover provides smart point-of-sale (POS) devices and associated payment services. After one upgrade, they noticed that a small percentage of their devices slowed down for a few hours in the morning. Looking at performance metrics, they saw that a light sensor was overloading the devices due to a software bug. It was an easy fix for them, but would have been hard to pinpoint without a complete picture of device and server activity. They were able to quickly rule out possible causes (such as server-side software/hardware issues) and shrink the potential context quickly. Clover was then able to correlate the local time with the device slow-down, and saw that the malfunction was triggered by the sunrise. Quick manipulation of metrics data across their entire stack told them the story of the bug from cause to consequence.
Faced with increasingly complicated software systems, effective DevOps needs to include a data-crunching capability. A big code deploy always comes with the potential for hidden bugs. A “fix” could have unintended consequences that propagate widely, and within distributed cloud systems, these problems can be harder to detect and isolate. Comprehensive metrics analytics is crucial for maintaining cloud infrastructure, making operational data science a must-have skill for DevOps.
About the author: Dev Nag is a founder and the CTO at Wavefront, a provider of real time analytic solutions. He was previously a Senior Software Engineer at Google where he helped develop the back-end for the financial processing of Google ad revenue. He also served as the Manager of Business Operations Strategy at PayPal where he defined requirements and helped select the financial vendors for tens of billions of dollars in annual transactions. Dev dropped out of the Stanford University School of Medicine to co-found Xiket, a venture-backed online health care portal for caretakers to manage the product and service needs of their dependents. Dev holds a B.S. in Mathematics and a B.A. in Psychology from Stanford University