SRE/Observability/Dashboard guidelines: Difference between revisions

From Wikitech
First revision
 
m Jobo moved page Observability/Dashboard guidelines to SRE/Observability/Dashboard guidelines: consistency with all SRE
(No difference)

Revision as of 13:41, 21 June 2021

Dashboard methods

Utilization Saturation Errors (USE)

This method is most effective to quickly diagnose any system performance issue. To quote Brendan Gregg's guide to USE:

 For every resource, check utilization, saturation, and errors.

The host overview dashboard shows and example of this method applied to inspect a single host's performance. Resources (CPU/network/etc) are placed in rows, the left column is used for the resource's utilization, while the right column displays saturation or errors, as applicable.

Four golden signals (4GS)

This method is described in detail in Google's SRE book and focuses on the system's user-impacting metrics. Specifically it can be used as a basis for alerting and diagnosis of ongoing problems.

This method can be seen applied to swift for example or sessionstore or any other service dashboard in the "Services" Grafana folder.

Data panel recommendations

  • Axes must be labeled
  • Y axis should be zero-based
  • Use fill zero, unless the graph is stacked
  • Ideally no more than four lines/metrics per panel