Page MenuHomePhabricator

Monitor prometheus exporters "up" status
Closed, ResolvedPublic

Description

We should monitor the "upness" (according to the up metric from Prometheus) of various prometheus exporters we have deployed now. The metric is exported automatically by Prometheus and set to 0 whenever Prometheus is unable to scrape metrics from the given exporter.

In addition to the up metrics, exporters often export the status of the underlying daemon as <daemon>_up and we should monitor that too.

Dashboard linked to the alerts: https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets

Event Timeline

Change 552521 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: alert on low job availability

https://gerrit.wikimedia.org/r/552521

Change 552521 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: alert on low job availability

https://gerrit.wikimedia.org/r/552521

Change 553335 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: alert on exporter's 'up' metrics

https://gerrit.wikimedia.org/r/553335

Change 553335 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: alert on exporter's 'up' metrics

https://gerrit.wikimedia.org/r/553335

fgiunchedi claimed this task.

All deployed now, boldly resolving

Change 889887 had a related patch set uploaded (by BCornwall; author: BCornwall):

[operations/alerts@master] varnish: Runbook and dashboard for down exporter

https://gerrit.wikimedia.org/r/889887

Change 889887 merged by BCornwall:

[operations/alerts@master] varnish: Runbook and dashboard for down exporter

https://gerrit.wikimedia.org/r/889887