SRE/Observability/Dashboard guidelines: Difference between revisions

From Wikitech
Content deleted Content added
mNo edit summary
Tag: 2017 source edit
mNo edit summary
Tag: 2017 source edit
Line 25: Line 25:




[[Category:SRE Infrastructure Foundations]]
[[Category:SRE Observability]]

Revision as of 12:39, 22 June 2021

Dashboard methods

Utilization Saturation Errors (USE)

This method is most effective to quickly diagnose any system performance issue. To quote Brendan Gregg's guide to USE:

 For every resource, check utilization, saturation, and errors.

The host overview dashboard shows and example of this method applied to inspect a single host's performance. Resources (CPU/network/etc) are placed in rows, the left column is used for the resource's utilization, while the right column displays saturation or errors, as applicable.

Four golden signals (4GS)

This method is described in detail in Google's SRE book and focuses on the system's user-impacting metrics. Specifically it can be used as a basis for alerting and diagnosis of ongoing problems.

This method can be seen applied to swift for example or sessionstore or any other service dashboard in the "Services" Grafana folder.

Data panel recommendations

  • Axes must be labeled
  • Y axis should be zero-based
  • Use fill zero, unless the graph is stacked
  • Ideally no more than four lines/metrics per panel