Icinga
Icinga ( http://www.icinga.org/ ) is a host and service monitoring software using a binary daemon, some cgi scripts for the web interface and binaries plugins to check various things. Basically, automated testing of our site that screams and sends up alarms when it fails. It originated as a fork of the earlier project "Nagios", from which WMF transitioned in 2013.
It can be set to monitor services such as ssh, squid status, mysql socket as well as # of user logged in, load, disk usage. There are two levels of alarms (warning, critical) and the notification system is fully customizable (groups of users, notified by email / irc / pager, stop notifying after x alarm...).
Our installation can be found at https://icinga.wikimedia.org which is currently an alias to machine icinga1001.
(April 2013) The rest of this page needs to be updated for icinga
This page may be outdated or contain incorrect details. Please update it if you can.
Quick summary
Installation
On the server
Install package from source found at http://www.nagios.org/download/core/thanks/ (both core and plugins packages are needed)
After installing, do this:
cp /home/wikipedia/conf/nagios/* /etc/nagios/ service start nagios
and you're away.
On each client
Ubuntu:
apt-get update apt-get -y install nagios-nrpe-server nagios-plugins scp fenari:/home/wikipedia/conf/nagios/nrpe-debian.cfg /etc/nagios/nrpe.cfg invoke-rc.d nagios-nrpe-server restart
Solaris:
pkgadd -d http://toolserver.org/~rriver/TSnrpe.pkg pkgadd -d http://toolserver.org/~rriver/TSnagios-plugins.pkg
The right answers are: all, yes, all, yes, yes.
mv /lib/svc/method/nagios-nrpe /lib/svc/method/nagios-nrpe.old sed 's/nrpe.cfg -dn/nrpe.cfg -d/' /lib/svc/method/nagios-nrpe.old > /lib/svc/method/nagios-nrpe chmod a+x /lib/svc/method/nagios-nrpe scp fenari:/home/wikipedia/conf/nagios/nrpe-solaris.cfg /etc/opt/ts/nrpe/nrpe.cfg scp fenari:/home/wikipedia/conf/nagios/check-zfs /opt/local/bin/ svcadm -v enable nrpe
If you're installing on a server with no internet access, you can use a local path to the pkg file instead.
Wikimedia-fication
Customization
* Logo: yeah... had to put it somewhere to show our 'leetness, so it is in /usr/local/nagios/share/images on Spence * Theme: I prefer a black theme. This is controlled in the CSS in /usr/local/nagios/share/stylesheets * Links to other services: this is controlled by /usr/local/nagios/share/side.php
ADDONS
* Merlin
Merlin Is an addon for Nagios that provides ease of integration and redundancy across multiple Nagios instances. Usually, we will want to have a Nagios installation in each Datacenter, and each instance should be able to talk to the other, share data, and act as a backup should one fail. This is in essence what Merlin offers. The interresting thing about Merlin is that it stores everything in a mysql DB, from host config, to statuses. This is a lot easier to use to parse data that Nagios' own files, which is why it was installed in the first place. However, at this moment nothing is making use of Merlin, and it is just there 'in case'. Find more information about Merlin at http://www.op5.org/community/projects/merlin​.
* Ganglia / wikitech integration
I wrote a little perl script that parses ganglia data from Spence (/var/lib/ganglia) and tries its best to match up Ganlia hostnames with Nagios hosts definition. In most cases it will work as advertised. The same goes for wikitech. Most servers don't have a wikitech entry associated with them, but some do. Most legacy systems and SPOF should have an entry.
This script is located on Nagios in /etc/nagios/generate_ext-info.pl and runs automagically when sync in called.
Paging / Alert System Details
Our icinga installation sends pages through VictorOps.
Configuration
There are two ways to setup monitoring: using the old PHP script, and using Puppet.
PHP script
There's a configurator script for adding hosts, host groups, services and service groups at /home/wikipedia/conf/nagios/conf.php . Run it somewhere with PHP CLI installed, i.e. fenari. The configurator writes to a file called hosts.cfg in the current directory.
cd /home/wikipedia/conf/nagios ./sync
Most host groups (the ones in $hostGroups) are based on dsh node group files. This is preferred for maintainability reasons, if such a node group exists, otherwise you can list miscellaneous hosts inline using $listedHosts. Some service groups (e.g. Apache and Squid) are just replicas of the host groups, others (such as Lucene and Memcached) are taken from the MediaWiki configuration. Services may also be listed inline using $listedServices, but again, this is not preferred.
Other configuration should be done by editing the *.cfg files on NFS and then copying to spence. Keeping two up-to-date copies like this protects us against failure of the monitoring host or NFS. (Note: the Sync command actually replicates every .cfg to Spence)
If nagios refuses to restart due to a configuration error, you can get more information by running this on the monitoring host (Spence):
nagios -v /etc/nagios/nagios.cfg
The error messages can be cryptic at times.
Puppet
Puppet is being integrated with Nagios as well, in file manifests/nagios.pp. To monitor the availability of a host, simply define the following anywhere under its node definition (i.e. in site.pp or included classes):
monitor_host { $hostname: }
To monitor a service, e.g. SSH, use something like the following:
monitor_service { "ssh": description => "SSH status", check_command => "check_ssh" }
Custom Checks
Custom checks can be found scattered throughout the operations/puppet git repo. Large concentrations of them can be found in these paths:
Many custom checks send HTTP requests to check the health of services. It's important that such checks follow meta::User-Agent policy to identify their traffic, and so that we don't inadvertently block monitoring requests when we need to shut off / ratelimit harmful bot traffic. The User-Agent sent by Icinga checks should be wmf-icinga/<script_name> (root@wikimedia.org).
Grafana
See Alerts (with notifications via Icinga).
To monitor an grafana dashboard's alerts, use something like the following:
monitoring::grafana_alert { 'db/my-dashboard': contact_group => 'my-team', }
Authentication
This information is outdated. Spence hasn't been a thing in years?
To add a user or update a password:
  1. Log in to Spence
  2. Run htpasswd /usr/local/nagios/etc/htpasswd.users <user>
IRC notification
Icinga appends messages to several different files /var/log/icinga/irc*.log, and ircecho (which runs as a systemd service) maps lines appended there to channels.
Sometimes the bot is wedged and a systemctl restart ircecho will likely fix it.
Updating channels where notifications are sent
  1. Submit a patch changing modules/profile/manifests/icinga/ircbot.pp
  2. Run puppet
  3. The bot will automatically restart on configuration change
Hostgroups
Hostgroups are configured on operations/puppet repository, on hieradata/common/monitoring.yaml
Acknowledgement logic
From Nagios Wiki (but this was just on Google Cache and the original site seemed gone, so pasted it here)
There is a difference between sticky and non-sticky acknowledgements
From Nagios 3.2.3. Assuming you have a service with notifications enabled for all states with a max retry attempts of 1, these are the notifications you should get based on the following transitions: #service in OK #service goes into WARNING - notification sent #non-sticky acknowledgement applied #service goes into CRITICAL. Acknowledgement removed. Notification sent #non-sticky acknowledgement applied #service goes into WARNING. Acknowledgement removed. Notification sent #non-sticky acknowledgement applied #service goes into CRITICAL. Acknowledgement removed. Notification sent #service goes into OK. Recovery notification sent This is the flow if sticky acknowledgements are used: #service in OK #service goes into WARNING - notification sent #sticky acknowledgement applied #service goes into CRITICAL. No notification sent #service goes into WARNING. No notification sent #service goes into CRITICAL. No notification sent #service goes into OK. Recovery notification sent
Scheduling downtimes with a shell command
Modern approach centralized:
From one of the cluster management hosts (cumin[12]001 as of August 2019) run the sre.hosts.downtime cookbook. See sudo cookbook sre.hosts.downtime -h for all the related info.
Modern approach:
This form of the command schedules a downtime of 2 hours
Old approach:
Put multiple hosts into a scheduled downtime, from now on for the next 3 days. Example used on Labs Nagios:
nagios command file is a named pipe at /var/lib/nagios/rw/nagios.cmd
for host in huggle-wa-w1 puppet-lucid turnkey-1 pad2 webserver-lcarr asher1 dumpster01 dumps-4 ; do printf "[%lu] SCHEDULE_HOST_DOWNTIME;$host;$(date +%s);1332479449;1;0;259200;Dzahn;down to save memory on virt3 having RAM issues\n" $(date +%s) \ > /var/lib/nagios/rw/nagios.cmd ; done
After a few seconds you should see something like this in /var/log/icinga/icinga.log (on icinga1001)
[1332220596] HOST DOWNTIME ALERT: dumpster01;STARTED; Host has entered a period of scheduled downtime
Command Format: SCHEDULE_HOST_DOWNTIME;<host_name>;<start_time>;<end_time>;<fixed>;<trigger_id>;<duration>;<author>;<comment>
quote: If the "fixed" argument is set to one (1), downtime will start and end at the times specified by the "start" and "end" arguments. Otherwise, downtime will begin between the "start" and "end" times and last for "duration" seconds. The "start" and "end" arguments are specified in time_t format (seconds since the UNIX epoch). The specified host downtime can be triggered by another downtime entry if the "trigger_id" is set to the ID of another scheduled downtime entry. Set the "trigger_id" argument to zero (0) if the downtime for the specified host should not be triggered by another downtime entry.
Removing downtimes with a shell command
All downtimes related to a host, including all its services, can be removed as follows. Note that the host variable is the name reported by hostname (not the FQDN returned with --fqdn). For example: cp4021.
echo -n "[$(date +'%s')] DEL_DOWNTIME_BY_HOST_NAME;$host" > /var/lib/icinga/rw/icinga.cmd
Disabling notifications programmatically
There many scenarios in which we may want a role to run its puppet logic to fully or partially provision its configuration, but not create alerts. Examples of this are:
For discussions about this topic, see task T151632
To disable notifications on a host, set profile::base::notifications on hiera to disabled (it defaults to enabled). This is intended as a temporary (even if extended on time) measure- if no check should exist when the server is in full production, just do not add it in the first place or change its LEVEL.
Failover Icinga between the active and passive servers
[As of Jan. 2018] Icinga is currently installed in an active/passive configuration on icinga1001.wikimedia.org (eqiad, usually active) and icinga2001.wikimedia.org (codfw, usually passive). Use icinga.wikimedia.org to check which one is the active one at any given time with: dig +short icinga.wikimedia.org.
To failover between the two servers, follow these steps:
Check validity of the Icinga's config
sudo /usr/sbin/icinga -v /etc/icinga/icinga.cfg
Meta-monitoring of Icinga itself
We're currently externally monitoring Icinga with a custom script. For the details see Wikitech-static#Meta-monitoring​.
Restarting
To avoid alerts from external meta-monitoring, meta-monitoring should be disabled on Wikitech-static before restarting normally with systemctl. Details on how to disable meta-monitoring can be found here: Service_restarts#Icinga
IRC bot
How to add some but not all notifications to a specific IRC channel.
The class used is profile::icinga::ircbot which uses ::ircecho and is included in profile::icinga. Server, nickname and port are configured in Hiera in hieradata/role/common/alerting_host.yaml. The tcpircbot class is unrelated though also included on the Icinga server.
Create 2 custom notification commands (modules/nagios_common/templates/notification_commands.cfg.erb), notify-service-by-irc-dcops and notify-host-by-irc-<YOUR CHANNEL>. So one for services and one for hosts. Copy the commandline from existing "by-irc" commands but make sure the output gets appended to a new log file, /irc-<YOUR CHANNEL>.log
Create a new Icinga contact (private repo, modules/secret/secrets/nagios/contacts.cfg), irc-<YOUR CHANNEL>. Copy an existing "irc-" contact but adjust host_notification_command and service_notification_command to use your new commands (and logfile).
Create a new Icinga contactgroup (public repo, modules/nagios_common/files/contactgroups.cfg) for datacenter ops and add the special contact you created to it (and optional the human members of your group so they get notified too). Check Icinga config is ok after running puppet (icinga -v /etc/icinga/icinga.cfg)
In puppet identify the monitoring::service / nrpe::monitor_service classes that should notify to to this channel (or make new ones) and add the new contactgroup as a parameter to them.
On the Icinga server go to /var/log/icinga/ and check if the new logfile has been created (you may have to touch it manually the very first time) and alerts get logged to it
Configure ircecho (modules/profile/manifests/icinga/ircbot.pp) to map the logfile to your IRC channel ($ircecho_logs, see existing examples) (restart ircecho?)
Avoid Icinga spam on new server installs
< volans> if you have more of those to do my suggestion is to temporarily disable puppet on icinga < volans> downtime for 1h, then merge the puppet patch, either wait 30 min or force a run on the new hosts (*not* all together though, -b 15 is a good one), then re-enable puppet on icinga and run the downtime with the force-puppet-run < volans> the double downtime is because the puppet apply might take down things that were already checked < volans> so you want to downtime the existing checks and then the new ones later on < volans> if you re-enable puppet late enough you don't need the downtime as all should already be green at the first attempt
The above translates to:
  1. disable puppet on icinga: ex: [icinga1001:~] $ sudo puppet agent --disable <reason/ticket ID>
  2. downtime for 1h: ex: dzahn@cumin1001:~$ sudo cookbook sre.hosts.downtime -r new_install -t T236437 -H 1 mw13[63,74-83].eqiad.wmnet
  3. merge in Gerrit / puppet-merge on puppetmaster
  4. force puppet run: ex: dzahn@cumin1001:~$ sudo -i cumin -b 15 'mw13[63,74-83].eqiad.wmnet' 'run-puppet-agent -q'
  5. re-enable puppet on icinga: [icinga1001:~] $ sudo puppet agent --enable
  6. run downtime with force-puppet-run: ex: dzahn@cumin1001:~$ sudo cookbook sre.hosts.downtime -r new_install -t T236437 -H 1 --force-puppet mw13[63,74-83].eqiad.wmnet
How to handle active alerts
Concepts
Icinga has a concept of handled vs. unhandled alerts. Handled alerts means they are either in a scheduled downtime or have been acknowledged as "known".
Unhandled alerts are the ones that are not known yet and need attention.
When looking at the Icinga web UI the ones we should pay special attention to are the ones that are both "unhandled" and in status "CRITICAL".
In the images below you can see 3 numbers each in the red sections. One section is about hosts and the other about services on hosts. The first of the 3 numbers is the number of unhandled CRITs.
Our goal is to keep the number of CRITICALs low and reasonable, in a way that balances the "good knowledge of unhandled ongoing issues" and "spam alerting" (making more difficult to detect problems). Ongoing problems that are already in progress of being fixed should be turned into handled ("known") alerts.
You can also go directly to the URL below to see all unhandled alerts at once, besides CRITs this also includes WARNs and UNKNOWNS:
https://icinga.wikimedia.org/alerts
Merely disabling notifications does not move alerts out of the unhandled section and has the potential to be forgotten. Consider using downtimes and ACKs instead. ACKs automatically are removed on the next state change. IMPORTANT: Never ack/disable an alarm blindly (check with owner or research yourself).
Statements
Lines on /alerts mean "there is something bad happening" and nobody saw it yet, as they're both unACKed and unDOWNTIMEd.
A perfect /alerts page is an empty /alerts page.
All alerts can be early signs of a larger failure, potentially user facing down the road (if not already).
SREs monitoring that page are from all backgrounds and can't know about the real severity of every alert.
UNKNOWN are as bad as CRITICALs as they can be hiding a CRITICAL.
WARNINGs are a bit hard to sort out, but could become CRITICALs.
Hosts and services must be downtimed before any maintenance. (Either via the UI or via cookbook)
All SREs should monitor for alerts on /alerts or #wikimedia-operations.
Runbook
If you're thinking "this alert should open a task automatically" mention it on T225140
If you believe a team is not handling alerts for things you believe they own, or a single service is creating too much spam alert, bring this up on through an awareness communication channel (e.g. SRE meetings, a manager) to try to give visibility as to why that may be happening and try to get it fixed.
Last edited on 5 August 2021, at 20:22
Wikitech
Content is available under CC BY-SA 3.0 unless otherwise noted.
Privacy policy
Terms of Use
Desktop
 Home Random Log in  Settings  Donate  About Wikitech  Disclaimers
WatchEdit