Follow Datanami:
April 11, 2022

AWS Charts a Multi-Pronged Path to IT Observability


There are scale problems in IT, and then there are AWS-scale problems. The size and complexity of the world’s largest public cloud provider dwarfs pretty much everything else. So  when it comes to collecting and analyzing logs, metrics, and events generated on its cloud, it’s not surprising that the company built its own solution. However, the story doesn’t end there, as the company is also making investments in open source.

Before joining Amazon Web Services almost a year ago as the vice president in charge of its monitoring and observability portfolio, Nandini Ramani spent years working at Twitter and Oracle, as well as startups. She knew how much data large operations could generate, especially telemetry data from Web and mobile applications, and how important that data could be to growth and customer satisfaction.

Just the same, Ramani had trouble comprehending the enormity of the observability figures she was hearing when she interviewed for the AWS job last year.

“We monitor 6 quadrillion metric observations per month. We ingest just over 3.5 exabytes of logs, and handle more than 32 trillion events,” Ramani tells Datanami. “These were stats that during my interview I heard, and I honestly could not even comprehend the number of zeros and the scale.”

Of course, AWS didn’t start out that big. By 2014, AWS had between 2.8 million and 5.6 million servers, which is just a fraction of what it has today. But even 15 years ago, the company was on the edge of what existing server and network monitoring solutions could handle.  So the Seattle, Washington-based company did what every right-thinking company would do when faced with a challenge for which there was no solution: It built its own.

The IT observability solution it built today is called Amazon CloudWatch. AWS rolled out access to CloudWatch to EC2 customers in 2009, and over the years, its usage has grown significantly. Today, there are more than 3 million active users of CloudWatch, which now supports about 70 of the 200-plus AWS services on offer.

Before joining AWS, Ramani was an external CloudWatch user, which gave her some insight into how the product is used and viewed by external customers. Now that she’s on the inside of the AWS firewall, she understands how important that internal use is to making CloudWatch what it is today.

“We integrate with all of our own services and we can pilot and beta test internally to make sure that it’s production ready,” she says. “It’s been an iterative process. We created it internally and now we’ve externalized it, and now we’re getting feedback from customers.”

CloudWatch provides observability into metrics, logs, and events generated by internal and external AWS users. It includes dashboards for viewers to see what’s going on, and can trigger an alarm when something goes awry in a customers AWS accounts. For serverless environment, there is CloudWatch Lambda Insights.

How CloudWatch works (Image source: AWS)

Several other products sit under the CloudWatch umbrella, including X-Ray, which delivers application tracing functionality to track down problems. Synthetic and real user monitoring and A/B testing frameworks have also been added to contribute to the observability cause.

In 2021, AWS broadened its observability reach when it launched managed services for Grafana and Prometheus, two popular open source observability products. It also launched the Amazon distribution for OpenTelemetry, a rapidly emerging standard for defining metrics, events, and (soon) logs.

While CloudWatch is AWS’s “primary featured offering,” the company will work with customers to consume observability data in their choice of platforms, Ramani says.

“I have never seen a company take customer obsession this vitally in everything we do every day. I think 97% of our roadmap is dictated and driven by customer needs,” she says. “We always put the customer first, and different customers have different needs, so we partner really well with ISVs…to make sure that no matter your destination of choice, whether its Amazon Managed Prometheus, whether it’s CloudWatch metrics or any ISV of your choice, we get you the data as quickly and easily as possible so that we get it to your destination rapidly.”

AWS’s observability solutions also aren’t limited to working in the AWS cloud. Many customers run hybrid setups with plenty of on-prem systems. For these customers, AWS will provide an agent that moves data from, say, your Red Hat OpenShift Kubernetes cluster up to the AWS cloud, where it can be consumed using the customers’ choice of products. Customers with Google Cloud and Microsoft Azure investments can also move data into the AWS observability solutions, Nandini says.

Currently, AWS is investing to bolster its solutions in the application performance management (APM) space, Nandini says. “We started out with synthetic user monitoring,” she says. “And then real user monitoring was also last year under the same umbrella of digital experience, which all fits under the broader APM space, which is also another area that we are investing heavily in.”

The company is going to be doing a lot in the APM space, Nandini added. “We’re iterating on what are the additional needs and what more can we be doing for customers to solve end user performance monitoring,” she says.

Regardless of the product customers use, the data format will likely be the same, because AWS is standardizing on OpenTelemetry as the open standard, like many other players in the observability, AIOps, and APM space. AWS is incorporating OpenTelemetry into its observability offerings where it makes sense, and its developers are also contributing changes in the upstream project, which is managed by the Cloud Native Computing Foundation (CNCF).

“We are all converging on OpenTelemetry,” Nandini says. “We do believe that is the way forward and we want to be able to just rally around OpenTelemetry as the default for us.”

AWS released a OpenTelemetry distribution for tracing last year, Nanidini says, and it’s gearing up to release another for metrics soon, with the still evolving standard for logs to follow.

“We are basically in lockstep with what  the community is doing around this,” she says. “I don’t think OpenTelemetery is there to replace everything today as it stands, because we need it to be across all logs, metrics, and traces.  But it’s certainly picking up a lot of momentum and we are fully participating and every change we make is in upstream and we have a lot of contributors to it and that’s what we are also embracing internally.”

The observability space is hot at the moment, as evidenced by Grafana Labs $240 million funding round last week. Events, logs, and metrics are piling up at a unprecedented rate, making observability one of the most pressing big data needs. Considering the progress AWS has made bringing customers into its cloud and the massive scale challenges it has already solved, the company will be one to watch as the observability market enters the next phase of its growth.

Related Items:

Grafana Labs Announces $240M Series D Round

Why AWS Keeps It Simple

A Rare Peek Into The Massive Scale of AWS (EnterpriseAI)