Mailman/Monitoring

From Wikitech

Logs are in /var/log/mailman3 and /var/log/mailman3/web, rotated daily with a max retention of 30 days. In the future they may end up in logstash too.

mailman3_runners

PROCS CRITICAL: 13 processes with UID = 38 (list), regex args '/usr/lib/mailman3/bin/runner'

Mailman3 has a number of job runners that process the in/out/bounces/virgin/etc. queues. There's a known bug upstream that the runners don't automatically restart when they crash. This alert detects that one of them has crashed and needs a restart. Currently the fix is to systemctl restart mailman3.

mailman3_queue_size

CRITICAL: 1 mailman3 queues above limits: bounces is 1053 (limit: 25)

Monitors the size of the in, virgin and bounces queues. You can also see the size on this dashboard. If this alert is firing, something most likely has happened to one of the runners (and the above mailman3_runners alert has fired too) and needs a restart (systemctl restart mailman3).

You can look at the queue files in /var/lib/mailman3/queues/<name>. Each file in the queue is a pickled file, you can dump it with mailman-wrapper qfile <filename> or just pickle.load(open('<filename>')).

The out queue size will not alert because it is common for it to reach ~hundreds of emails when someone emails a large list.

Misc stuff

We monitor a few key components:

  • HTTP
    • mailman archives: (hyperkitty) Are the archives for wikimedia-l reachable?
      • If not, check apache2.service and mailman3-web.service.
    • mailman list info: (postorius) Is the subscribe page for wikimedia-l reachable?
      • If not, check apache2.service and mailman3-web.service.
  • Queues
    • Mailman outbound queue hours until empty: How long will it take to drain the out queue?
      • If it's going to take a long time, check:
        • Date and time of day. Historically, the first of the month has the longest queue of about 15+ hours long and 08:00 UTC each day about 1+ hours long.
        • dashboard
        • logs in /var/log/mailman3 and /var/log/exim4
        • you may have to look at the messages themselves in /var/lib/mailman3/queues/out to see what is queued up/not sending.