Network monitoring
Wikimedia infrastructure
Data centres and PoPs
Monitoring resources
IcingaLDAPNetwork monitoring#Icinga alerts
External monitoringOpen See bug T199816
RIPE AtlasSemi-open
BGPmonExternalNetwork monitoring#BGPmon alerts
Cloudflare BGP leak detectionExternalemails to noc@
RIPE RPKIExternalNetwork monitoring#RIPE Alerts
Icinga alerts
host (ipv6) down
  • If service impacting (eg. full switch stack down).
    1. Depool the site if possible
    2. Ping/page netops
  • If not service impacting (eg. loss of redundancy, management nework)
    1. Decide if depooling the site is necessary
    2. Ping and open high priority task for netops
Router interface down
CRITICAL: host '', interfaces up: 212, down: 1, dormant: 0, excluded: 0, unused: 0<BR>xe-3/2/3: down -> Core: cr2-codfw:xe-5/0/1 (Zayo, circuitID) 36ms {#2909} [10Gbps wave]<BR>
Go look at the alert on the Icinga portal, the full description will be visible there.
The part that interests us is the one between the <BR> tags. In this example:
  • Interface name is xe-3/2/3
  • Description is Core: cr2-codfw:xe-5/0/1 (Zayo, circuitID) 36ms {#2909} [10Gbps wave]
    • Type is Core, other types are for example: Peering, Transit, OOB.
    • The other side of the link is cr2-codfw:xe-5/0/1
    • The circuit is operated by Zayo, with the after-mentioned circuit ID
    • The remaining information are optional (latency, speed, cable#)
If such alert shows up:
First, all links are redundant, but don't hesitate to depool the site if it's showing signs of a larger outage.
Identify the type of interface going down
  • 3rd part provider: Type can be Core/Transport/Transit/Peering/OOB, a provider name identifiable and present on that list
  • Internal link: Type is Core, no provider name listed
If 3rd party provider link
  1. Verify if the provider doesn't have a planned maintenance for that circuit ID on the maintenance calendar
  2. Verify if the provider didn't send a last minute maintenance or outage email notification
If scheduled or provider aware of the incident
  1. downtime the alert for the duration of the maintenance
  2. monitor that no other links are going down (risk of total loss or redundancy
If unplanned
  1. Open a phabricator task, tag netops, include the alert and timestamp
  2. Contact the provider using the information present in Netbox, make sure to include the circuit ID, and time when the outage started, cc noc@wiki. If able, include the output of show interfaces diagnostics optics <interface_name> from both sides.
  3. If needed, escalate to netops
  4. Monitor for recovery, if no reply to email within 30min, call them
  5. Close the task if quick recovery
If internal link
  1. Open a phabricator task, tag netops and dcops, include the alert and timestamp
  2. Most likely the optic need to be replaced on one of the ends.
Juniper alarm
  • If warning/yellow: open a phabricator task, tag netops
  • If critical/red: open a phabricator task, tag netops, ping/page netops
You can get more information about the alarm by issuing the command show system alarms on the device.
BFD status
Follow Network monitoring#Router interface down
If the interface is not down, please check the following:
  • show bfd session will give you a summary of what link(s) are considered down by BFD.
  • show ospf neighbor / show ospf3 neighbor - is the peer up? If not, please check if there are OSPF alarms ongoing.
  • if OSPF looks good, it might be due to BFD being stuck in some weird state.
  • run clear bfd session address $ADDRESS (with $ADDRESS == IP address gathered in show bfd session)
OSPF status
Follow Network monitoring#Router interface down
BGP status
The AS# we consider critical (transit) are defined on that line.
If warning/yellow: open a phabricator task, tag/ping netops.
If the AS# is not private (outside of 64512 - 65534), it means it's an IX peer, please do the following (to the max of your abilities):
  • Run show bgp summary | match <AS#> Everything other than "Establ" means it's down (even Active is down)
  • Check on PeeringDB​<AS#> if the AS is still at the IXP, if not, remove it.
  • Check logs show log messages | match <AS#>
  • Try to ping the peer IP
  • Send an email to the peering (or NOC) email listed on their PeeringDB page (login required), CC peering@wiki, include the above troubleshooting, ask if there is any issues, if they can investigate, if we should remove the peering.
If critical/red: consider similar router interface down.
  1. Identify the peer name: in a terminal type `whois as#####` or lookup the AS number on
  2. Follow the router interface down instructions.
VCP status
  1. Open high priority DCops task and tag netops
  2. For DCops:
    1. run show virtual-chassis vc-port and identify the faulty port(s)
      eg. 1/2         Configured         -1    Down         40000
    2. re-seat the cable on both sides, if no success, replace optics or DAC
    3. If still down, escalate to Netops
Atlas alerts
See also: RIPE Atlas
PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 22 probes of 318 (alerts on 19) -!map
This one is a bit more complex as it usually need some digging to know where the issue exactly is.
It means there is an issue somewhere between the RIPE Atlas constellation, the "in-between" transit providers, our providers, and our network.
as a rule of thumbs though:
First, monitor for drops in real HTTP traffic (e.g. on the Varnish dashboard) and check the NELdashboard for signals of connectivity issues from real user traffic.
Be ready to de-pool the site (when possible) and page Netops if signs of a larger issue.
  • If a high number of probes fail (eg. >75%) or if both IPv4 and IPv6 are failing simultaneously, and no quick recovery (~5min) it is less likely of a false positive, ping Netops
  • If flapping with a failing number of probe close to the threshold, its possibly a false positive, monitor/downtime and open a high priority Netops task
  • If it matches an (un)scheduled provider maintenance, it is possibly a side effect, if no quick recovery, page Netops to potentially drain that specific link
Lastly, sometimes this alert could be raised due to 500 errors from the RIPE Atlas servers, there is not much we can do in that case. (In this case you should see a slightly different error message from above, as there won't be a valid # of failed probes.)
To run the check manually use the following from one of the icinga/alert servers​/usr/lib/nagios/plugins/ $msg_id 50 35 -v (add -vv debug info) e.g.
$ /usr/lib/nagios/plugins/ 11645085 50 35 -v UDM 11645085Allowed Failures 35Allowed % percent loss 50Total 672Failed (9): ['6641', '6482', '6457', '6650', '6397', '6209', '6722', '6409', '6718']​​OK - failed 9 probes of 672 (alerts on 35) -!map
NEL alerts
See also: Network Error Logging
This one is a bit more complex as it usually need some digging to know where the issue exactly is.
It means there is an issue somewhere between our userbase, the "in-between" transit providers, our providers, and our network.
Presently we only alert on a much higher-than-usual rate of tcp.timed_out and tcp.address_unreachable reports, which tend to indicate real connectivity issues. However, the problem may not always be actionable by us -- a large ISP having internal issues can trip this alert.
Things to check:
  • the NEL dashboard using the various breakdowns (geoIP country, AS number/ISP, Wikimedia server domain, etc) to attempt to perform a differential diagnosis of what the issue is
  • check for drops in received HTTP traffic (e.g. on the Varnish dashboard)
  • check for any corresponding RIPE Atlas alerts
If the pattern of reports implicate one edge site, be ready to depool it and see if this resolves the issue.
VRRP status
Open high priority Netops task.
BGPmon alerts
RPKI Validation Failed
  1. Verify that the alert isn't a false positive
  2. If the alert seems genuine, escalate to netops as it might mean
    • That prefix is being hijacked (voluntarily or not)
    • A miss-configuration on our site can result in sub-optimal routing for that prefix
RIPE alerts
Resource Certification (RPKI) alerts
See BGPmon/RPKI, you can use other validation methods listed bellow.
Note that RIPE will only alert for the prefixes it is in charge of. See IP and AS allocations for the list.
LibreNMS alerts
List of current alerts listed on
Unless stated otherwise, open a tracking task for netops, then ack the alert (on the page above). Page if it's causing larger issues (or have any doubt).
If an alert is too noisy, you can mute it on edit the alert and flip the "mute" switch.
Primary outbound port utilization over 80%
The interface description will begin with the type of link saturating (or close to saturation).
  • Transit or peering: usually mean someone (eg. T192688) is sending us lots of queries of which the replies are saturating a outbound link
    1. Identify which source IP or prefix (webrequest logs, etc)
    2. Rate limit, block, or temporary move traffic to another DC (eg. with DNS)
    3. Contact the offender
  • Core: usually a heavy cross DC transfer
    1. Identify who started the transfer (SAL, IRC), or which host are involved (manually dig down in LibreNMS's graphs)
    2. Ask them to stop or rate limit their transfer
Sensor over limit
Can mean a lot of things, often a faulty optics.
Juniper environment status
Often a faulty part in a Juniper device (eg. power suply).
Juniper alarm active
See Network monitoring#Juniper alarm
DC uplink low traffic
Means that a still active link saw its outbound traffic drop. Can mean than something is wrong with the device or routing.
Ensure that the site has proper connectivity. Depool the site if not.
Processor usage over 85% or Memory over 85%
  1. Gather data by issuing the command show system processes summary and show chassis routing-engine
  2. Watch the site for other signs of malfunctions
  3. If no quick recovery (~30min), escalate to netops
Storage /var over 50% or Storage over 90%
  1. Look for core dumps with show system core-dumps
    If any it needs to be escalated to JTAC
  2. Look for other large files in /var/tmp and /var/log
  3. If normal growth, cleanup storage with request system storage cleanup
Critical or emergency syslog messages
Escalate to netops, watch the site for other signs of failure.
Inbound/outbound interface errors
Usually mean faulty optics/cable/port.
If you can connect to the network device, run show interfaces <interface> extensive | match error in order to have more information on the errors.
  • If a server, notify its owner and assign the task to DCops.
  • If a core/transit/peering/etc link, look for any provider maintenance notification (expected or not).
    • If none, assign the task to DCops, CC netops.
    • If any, wait for maintenance to end, watch for other signs of failures
Similar task: T203576
Traffic bill over quota
Because checks are attached to devices, a bill going over threshold will alert for every devices linked to the said bill.
  1. Open a WMF-NDA Netops task (as it's about contracts)
  2. CC directors (as it's about billing)
  3. Ack the alerts in LibreNMS
  4. Use Netflow to figure out what traffic to steer away
  5. Use the AVOID-PATH feature of Homer
Poller is taking too long
Might indicate connectivity issue to the device's mgmt or an issue with its SNMP daemon.
BGP peer above prefix limit
Could either mean:
Peer has naturally grew past the current limit
  1. Identify faulty peer (IP and ASN in the message)
  2. Show current limit for that peer show configuration protocols bgp
  3. Get their recommended limit on PeeringDB
  4. Update configuration
If not, peer had a miss-configuration and started to export prefixes that it shouldn't
  1. Wait a few hours
  2. Clear the bgp session clear bgp neighbor <IP>
  3. If still triggering the limit, keep down, contact peer (info in peeringDB)
Port with no description on access switch
Open DCops task to update description or disable port.
Port down
Open DCops task to investigate.
Traffic on tunnel link
Means that all links to a site are down and traffic is going through the last resort path.
  1. Escalate to netops
  2. Depool site
  3. Watch for provider maintenance notification
Duplicate IP on mgmt network
Open task for DCops to investigate/fix.
In the email there will be a line like:
arp info overwritten for from 4c:d9:8f:80:74:8c to 4c:d9:8f:80:23:9a
This mean the IP is shared between the two mac addresses.
It usually mean someone typoed an IP recently.
Try to ssh to the IP, run "racadm getsysinfo" to get the service tag. then Netbox to get the host names. Compare that hostname to the one in DNS.
Storm control in effect
More information on Storm_control
This mean that something triggered a broadcast storm on the port being shutdown. For example by looping a cable.
  1. Open a DCops task
  2. Identify and remove the source of the storm
  3. clear the error clear ethernet-switching port-error <port_name>
  4. Monitor for recovery
Some of the syslog messages seen across the infra and their fix or workaround are listed on
Last edited on 8 October 2021, at 17:58
Content is available under CC BY-SA 3.0 unless otherwise noted.
Privacy policy
Terms of Use
 Home Random Log in  Settings  Donate  About Wikitech  Disclaimers