Clone this repo:

Branches

  1. 4865e17 Merge "Add Flink alerts for Cirrus Streaming Updater" by Bking · 3 days ago master
  2. 4ef9cc8 Add Flink alerts for Cirrus Streaming Updater by Brian King · 7 weeks ago
  3. fc1d79f Merge "team-ml: Add alerting rule for high error rate in LW services" by jenkins-bot · 3 days ago
  4. d22beb2 team-ml: Add alerting rule for high error rate in LW services by Tobias Klausmann · 8 days ago
  5. 2ffe3ed Merge "search: Wait for young pool alert to fail for 5 minutes" by jenkins-bot · 3 days ago

Prometheus alerts repository

In this repository you will find the Prometheus-based alerts deployed to production, split by team.

The alerts will be deployed to all site-local Prometheus instances by default (i.e. ops, k8s, etc)

For more information refer to Alertmanager's wikitech page: https://wikitech.wikimedia.org/wiki/Alertmanager

Testing

CI will run tox on this repository at code review time. You can also run tests locally by calling tox (python 3). You'll also need to have the following tools in your $PATH:

On Debian systems the promtool binary is part of prometheus package, which will also start the Prometheus server. To stop the server and stop it from starting at boot issue the following:

systemctl stop prometheus
systemctl mask prometheus

To also disable the timers for various node exporters, run:

systemctl list-timers prometheus* | perl -ne 'print "$1\n" if /(prometheus-.+\.timer)/' | \
    xargs sudo systemctl disable

Finally, to also disable pint at startup run the following:

systemctl stop pint
systemctl mask pint

Testing with Docker

Tests can run locally using the CI image blubber file provided with this repository.

Build an image from .pipeline/blubber.yaml with:

DOCKER_BUILDKIT=1 docker build --target test -t alerts-tests -f .pipeline/blubber.yaml .

Run the test container with:

docker run --entrypoint tox alerts-tests

Deploying

The repository is self-service for wmf LDAP group users. In other words, a +2 will trigger CI tests and merge (if tests pass). Post-merge the alerts will be deployed at the next Puppet run (i.e. in 30 min).