MAY 13, 2017
PrometheusHow to
The RED Method: key metrics for microservices architecture
In previous blog posts and talks I’ve alluded to The RED Method, our monitoring philosophy at Weaveworks. In this blog I’ll outline what The RED Method is, why we use it, and where it comes from.
What is The RED Method?
The RED Method defines the three key metrics you should measure for every microservice in your architecture. Those metrics are:
Measuring these metrics is pretty straightforward, especially when using tools like Prometheus (or Weave Cloud’s hosted Prometheus service). I’ve already written a blog post on how we instrument our services in Weave Cloud, so I’m not going to cover it here.
Another nice aspect of the RED method is that it helps you think about how to build your dashboards. You should bring these three metrics front-and-center for each service and error rate should be expressed as a proportion of request rate. At Weaveworks, we settled on a pretty standard format for our dashboards - two columns, one row per service, request & error rate on the left, latency on the right:
We’ve even built a Python library to help us generate these dashboards: GrafanaLib.
Why The RED Method?
Why should you measure the same metrics for every service? Surely each service is special? The benefits of treating each service the same, from a monitoring perspective, is scalability in your operations teams. Which, if you are like Weaveworks, means my fellow developers and me.
What does scalability of an operations team mean? I look at this from the point of view of how many services a given team can support. In an ideal word, the number of services the team can support would be independent from its team size, but dependent on other factors - what kind of response SLA you want, whether you want 24/7 coverage etc. So how do you decouple the number of services you can support from the team size? By making every service look, feel and taste the same. This reduces the amount of service-specific training the team needs, and reduces the service-specific special cases the oncalls need to memorize for those high-pressure incident response scenarios - what has been referred to as “cognitive load.”
As an aside, if you treat all your services the same, many repetitive tasks become automatable. Capacity planning? Do it as a function of QPS and latency. Dashboards and alerts with links to playbook entries and those dashboards? Automatically generate them.
Where does The RED Method come from?
I can’t take any credit for this philosophy, as it is 100% based on what I learned as a Google SRE. Google calls it their “The Four Golden Signals”. The Google SRE book is a great read, and goes into way more depth that I can.
Google include an extra metric, Saturation, over and above the RED method. I don’t include Saturation because, in my opinion, it is a more advanced use case. I think the first three metrics are really the most important, and people remember things in threes… But it you’ve mastered the first three, by all means include Saturation.
The name “The RED Method” started life as a tongue-in-cheek play on Brendan Gregg’s USE Method - another recommended read. The USE Method was being circulated around the office when we started Weave Cloud and were discussing how to monitor it. I felt like I needed a catchy name for the monitoring strategy I was proposing, so The RED Method was born. The USE Method is a fantastic way to think about how to monitor resources; we use it as a framework to monitoring the infrastructure behind Weave Cloud. However, I think the abstraction becomes a little strained when talking about services.
Limitations
It is fair to say this method only works for request-driven services - it breaks down for batch-oriented or streaming services for instance. It is also not all-encompassing. There are times you will need to monitoring other things - the USE Method is a great example when applied to resources like host CPU & Memory, or caches.
Thank you for reading our blog. We build Weave Cloud, which is a hosted add-on to your clusters. It helps you iterate faster on microservices with continuous delivery, visualization & debugging, and Prometheus monitoring to improve observability.
Try it out, join our online user group for free talks & trainings , and come and hang out with us on Slack.
Learn how to Automate Kubernetes with GitOps
microservices metrics monitoringcortex weave-cortex weave-cloud​troubleshooting
ABOUT TOM WILKIE

Tom is a Software Engineer at Weaveworks. Previously he was at Google as a Site Reliability Manager for Google Analytics. Before that he was Founder, VP Eng and CTO at Acunu, and before that a Software Engineer at XenSource.In his spare time, Tom likes to make craft beer and build home automation systems.
< PREVIOUS
NEXT >
You may also like:
OCTOBER 20, 2020
AwsHow toKubernetes
Introducing EKS support in Cluster API
OCTOBER 06, 2020
Weave kubernetes platformPrometheus
Living on the Edge - How Screenly Monitors Edge IoT Devices with Prometheus
MARCH 24, 2020
Weave kubernetes platformGitopsHow to
An Overview of Modern Ops in the Enterprise
RESOURCES
Blog
Events
Podcast
Resource Center
GitOps on AWS
Kubernetes on Azure
SERVICES & SUPPORT
Docs
Professional Services
Contact Sales
Contact Support
COMPANY
About Us
Contact Us
Customers
Careers
Partners
Press
LEGAL DOCUMENTS
EUSA
Privacy Policy
SLA
Terms and Conditions
FOLLOW US
2014-2021 WEAVEWORKS
Weave GitOps EnterpriseWeave GitOps CoreWeave CloudPRICINGContinuous Application DeliveryCluster Lifecycle ManagementGovernance, Risk & ComplianceHybrid & MulticloudProgressive Application DeliverySelf-Service & Developer AutonomyTelecommunicationFinancial ServicesAll CustomersCortexeksctlFirekubeFlaggerFluxNetScopeWKSctlBLOG
SIGN INREQUEST A DEMO