Page MenuHomePhabricator

Fix Icinga checks for test/decom servers
Closed, ResolvedPublic

Description

We have a bunch of test servers that have all the checks configured in Icinga and then a very long scheduled downtime with an optional disabled notification. Usually the same thing happen for to-be-decom servers.

I think that this is the wrong approach because in time Icinga base checks are added or renamed and the original scheduled downtime for the host and all services is now only for the host and some services, defying the purpose of having this host not alarming at any time.

We should instead puppetize this so that those kind of hosts have a special configuration in Icinga so that the check names have some sort of identifier like TEST INSTANCE or DECOM to clearly recognize them on Icinga and IRC and be configured so that they don't ever page.

Here some random examples, see the full list on Icinga downtimes page:

Event Timeline

We should instead puppetize this so that those kind of hosts have a special configuration in Icinga so that the check names have some sort of identifier like TEST INSTANCE or DECOM to clearly recognize them on Icinga and IRC

So i can imagine we would add code in the base puppet class that changes things based on a Hiera override, so we can just put something in Hiera to mark it as a test/decom server, ok. (This makes me think of the existing role::spare btw). But then what exactly would we change, you say to add an identifier but is that an additional benefit? leaving a comment / ACKing / scheduling downtime are already ways to identify that too. (unhandled vs. handled CRITs). So i can recognize them. We still want the checks to exist, do we? Do we want to change the check periods / intervals? Is it about disabling IRC notifications?

and be configured so that they don't ever page.

Hosts don't page anyways, just services that are explicitely marked as "critical => true". Usually it's just virtual hosts / service IPs that have services that are actually paging.

Change 327388 had a related patch set uploaded (by Dzahn):
hiera override to skip base icinga for test/decom hosts

https://gerrit.wikimedia.org/r/327388

^ Here's an approach to make it simpler and just skip the whole base::monitoring part if set in Hiera.

Change 327388 merged by Dzahn:
icinga/base: add hiera override to skip base monitoring

https://gerrit.wikimedia.org/r/327388

Change 336879 had a related patch set uploaded (by Dzahn):
cp1008: do not attempt to skip Icinga base monitoring

https://gerrit.wikimedia.org/r/336879

Change 336879 merged by Dzahn:
cp1008: do not attempt to skip Icinga base monitoring

https://gerrit.wikimedia.org/r/336879

After the merges above, now if you use "role::spare" on a node then base monitoring gets skipped.

It's just that currently no servers are using role::spare in site.pp anymore (After we decom'ed and shutdown a few former varnish boxes that were).

Of the examples in the ticket description:

cp1008 - tried but it's a special case because it also uses role cache::text with a requirement for a file in base monitoring. see https://gerrit.wikimedia.org/r/336879 but normally this would not be the case with test or decom hosts, can maybe be fixed in that role

db1019 - already gone

db1073 - looks like back in service normally

So that's that. I think we can close here. @Volans

This doesn't work to remove hosts from Icinga that are already in it.. it only does for new hosts that have never been added... More changes would be needed to base::monitoring::host to also make it work to actively remove checks.

So unlike i said before.. don't close yet. not really fully resolved, just partially.

Change 336956 had a related patch set uploaded (by Dzahn):
icinga/base: revert skipping base monitoring for role::spare hosts

https://gerrit.wikimedia.org/r/336956

Change 336956 merged by Dzahn:
icinga/base: revert skipping base monitoring for role::spare hosts

https://gerrit.wikimedia.org/r/336956

I have tried this twice using the simplistic approach with "profile::base::monitoring: false" but that does not actually work.

by role::spare::system https://gerrit.wikimedia.org/r/#/c/336956/

by nodes named labtest* https://gerrit.wikimedia.org/r/#/c/364444/

it needs fundamental changes to base::monitoring::host because, as Giuseppe said on Gerrit "The problem is that this removes the monitoring::host definition, but it doesn't disable all the services from being defined. That breaks the icinga configuration."

It should work just fine but only for NEW hosts and the issue is with existing hosts that already have services on them. If hosts get removed and services don't , Icinga breaks. Looking at it more now...

Change 368124 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] base::monitoring: make it possible to disable monitoring

https://gerrit.wikimedia.org/r/368124

Alex said on Gerrit: "This patch makes it possible for a host to not be in our icinga installation configured. Which is not what T151632 originally asked for. I am not sure we want to have "ghost" hosts in our infrastructure (we 've had that in the past) that are there but are not monitored." hmm..

I think we should clarify a bit better what we want to do so that we all are on the same page. So here's a couple of questions to help with that.

  • Do we want test/decom servers to be in icinga ?
  • Do we want to run our usual battery of checks on them ?
  • Do we want to view the results of these checks in the web interface ?
  • Do we want to receive notifications for failures for the above checks ?

My answers are YES, YES, YES, NO. Feel free to answer these differently.

My reasoning is that I don't want us to have "ghost" hosts that are up and running and providing services in some form to something/someone without being monitored. At the same time I don't want to have page/IRC notifications for those. Finally in the web interface those hosts should somehow be fully ACKed/forever scheduled downtimed or something equivalent.

I think we can get almost all of the the above by setting notifications_enabled to 0 in the icinga host definition and all it's services. What happens with that is that we stop receiving the notifications and in the web interface there's a small "mute" icon next to the host/service having being muted. The one thing we unfortunately do not get is the web interface categorizing those hosts/services as ACKed/Handled. For bonus points it seems possible to do that too. We could we write a service handler that will only be enabled for those hosts/services and would effectively automatically ACK the aforementioned hosts/services

@Volans, @Dzahn , opinions ?

My answers to the above questions are: YES, YES, YES (but I'd like them to be separated in the UI, unfortunately this is not possible in Icinga), NO

For decom servers I'm assuming they will be already in the spare role, so just few checks will be run, correct me if I'm wrong.

An event handler that ACKs blindly any WARNING/CRITICAL/UNKNOWN in HARD state it's pretty quick to write (just stealing pieces from the RAID one) and might do the trick. If something is flapping it will be called multiple times, but I don't see it really a problem to be honest. The number of hosts and checks involved should be usually small. It could also downtime those, but I'm not sure it is worth at this point.

Optionally we could reduce the frequency of all checks for decom servers, but probably not for tests ones.

Question- could a similar solution be applied to "in installation" hosts? Databases can take a day or more to provision, but they need the full role to be able to be provisioned. However, right now when a new hosts gets puppet runing, we have to rush to ack its alerts because at some random point in the future (when puppet gets run on icing after running on the host). Having all roles applied but an easy way to disable not-yet-applied checks and notification would be ideal.

@jcrespo, assuming I have understood correctly what you want, yes I think so.

I think so too but it might need some parameter or hiera value to define those as "provisioning", given that they will have already the production MariaDB role but will not be fully provisioned. So if I'm understanding it correctly, yes it's possible but it will require an additional commit to remove the "provisioning" param/hiera once the provisioning is completed.

Change 373291 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] WIP: Allow silencing notifications for hosts

https://gerrit.wikimedia.org/r/373291

What is the advantage having these servers in monitoring if we also go through great lengths to make sure we don't see them (no notifications, ACKed). Is anyone actually going to look at the webinterface in the ACKed section and then have any reaction that isn't "oh these are just test servers and they are already ACKed"?

@Dzahn, that's a fair question. On my part I see the following. There's definitely the benefit of getting hardware monitoring. Both at the host level (RAID checks, IPMI temperature checks) as well as the BMC level (reachability, DNS correctness, SSH login functionality). There's also the capability to "half-provision" hosts (see @jcrespo use-case above) and finish the provisioning some time later. I do get the worry about things being auto ACKed and eventually not be noticed.. Perhaps we should skip that part in our try, stic to manual ACKs and evaluate later down the line if we would want them. Does that sound reasonable ?

@Dzahn, does the above sounds reasonable ?

@akosiaris Sorry for the late reply, yes that definitely sounds reasonable. Let's go ahead as you suggested, incl. the part about evaluating AUTO-ACKS later i guess.

Change 373291 merged by Alexandros Kosiaris:
[operations/puppet@production] Allow silencing notifications for hosts

https://gerrit.wikimedia.org/r/373291

This seems to work fine. cp1008 and all services are marked as muted in icinga web and cp1046 (spare::system role) the same. Jaime was kind enough to provide some documentation in https://wikitech.wikimedia.org/w/index.php?title=Icinga&type=revision&diff=1769985&oldid=1763907. I 'll resolve the task for now, feel free to reopen if any reason comes up

Change 368124 abandoned by Dzahn:
base::monitoring: make it possible to disable monitoring

https://gerrit.wikimedia.org/r/368124