Page MenuHomePhabricator

Icinga Check SSL might have a time based race condition
Closed, ResolvedPublic

Description

I have seen that this morning:

03:39:46 <icinga-wm> PROBLEM - Gerrit Health Check SSL Expiry on gerrit.wikimedia.org is CRITICAL: CRITICAL - Certificate gerrit.wikimedia.org expires in 7 day(s) (Sat 28 May 2022 08:33:22 PM GMT +0000). https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus
03:42:02 <icinga-wm> RECOVERY - Gerrit Health Check SSL Expiry on gerrit.wikimedia.org is OK: OK - Certificate gerrit.wikimedia.org will expire on Wed 27 Jul 2022 08:27:52 PM GMT +0000. https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus

I am not sure how often we check SSL expiry, but that one apparently kicked in just before the Let's Encrypt certificate got updated. The new one expires on Wed, 27 Jul 2022 20:27:52 GMT.

Then looking at the alert history it seems to alternate between two different certificates?

OK - Certificate 'gerrit.wikimedia.org' will expire on Wed 27 Jul 2022 08:27:52 PM GMT +0000.
CRITICAL - Certificate 'gerrit.wikimedia.org' expires in 7 day(s) (Sat 28 May 2022 08:33:22 PM GMT +0000).

A bit more:

Service Ok[2022-05-21 03:51:01] SERVICE ALERT: gerrit.wikimedia.org;Gerrit Health Check SSL Expiry;OK;SOFT;3;OK - Certificate 'gerrit.wikimedia.org' will expire on Wed 27 Jul 2022 08:27:52 PM GMT +0000.
Service Critical[2022-05-21 03:48:45] SERVICE ALERT: gerrit.wikimedia.org;Gerrit Health Check SSL Expiry;CRITICAL;SOFT;2;CRITICAL - Certificate 'gerrit.wikimedia.org' expires in 7 day(s) (Sat 28 May 2022 08:33:22 PM GMT +0000).
Service Critical[2022-05-21 03:46:29] SERVICE ALERT: gerrit.wikimedia.org;Gerrit Health Check SSL Expiry;CRITICAL;SOFT;1;CRITICAL - Certificate 'gerrit.wikimedia.org' expires in 7 day(s) (Sat 28 May 2022 08:33:22 PM GMT +0000).
Service Ok[2022-05-21 03:42:01] SERVICE ALERT: gerrit.wikimedia.org;Gerrit Health Check SSL Expiry;OK;HARD;3;OK - Certificate 'gerrit.wikimedia.org' will expire on Wed 27 Jul 2022 08:27:52 PM GMT +0000.
Service Critical[2022-05-21 03:39:45] SERVICE ALERT: gerrit.wikimedia.org;Gerrit Health Check SSL Expiry;CRITICAL;HARD;3;CRITICAL - Certificate 'gerrit.wikimedia.org' expires in 7 day(s) (Sat 28 May 2022 08:33:22 PM GMT +0000).
Service Critical[2022-05-21 03:37:29] SERVICE ALERT: gerrit.wikimedia.org;Gerrit Health Check SSL Expiry;CRITICAL;SOFT;2;CRITICAL - Certificate 'gerrit.wikimedia.org' expires in 7 day(s) (Sat 28 May 2022 08:33:22 PM GMT +0000).
Service Critical[2022-05-21 03:35:13] SERVICE ALERT: gerrit.wikimedia.org;Gerrit Health Check SSL Expiry;CRITICAL;SOFT;1;CRITICAL - Certificate 'gerrit.wikimedia.org' expires in 7 day(s) (Sat 28 May 2022 08:33:22 PM GMT +0000).
Service Ok[2022-05-21 03:32:59] SERVICE ALERT: gerrit.wikimedia.org;Gerrit Health Check SSL Expiry;OK;SOFT;2;OK - Certificate 'gerrit.wikimedia.org' will expire on Wed 27 Jul 2022 08:27:52 PM GMT +0000.
Service Critical[2022-05-21 03:30:43] SERVICE ALERT: gerrit.wikimedia.org;Gerrit Health Check SSL Expiry;CRITICAL;SOFT;1;CRITICAL - Certificate 'gerrit.wikimedia.org' expires in 7 day(s) (Sat 28 May 2022 08:33:22 PM GMT +0000).
Service Ok[2022-05-21 03:21:45] SERVICE ALERT: gerrit.wikimedia.org;Gerrit Health Check SSL Expiry;OK;SOFT;2;OK - Certificate 'gerrit.wikimedia.org' will expire on Wed 27 Jul 2022 08:27:52 PM GMT +0000.
Service Critical[2022-05-21 03:19:31] SERVICE ALERT: gerrit.wikimedia.org;Gerrit Health Check SSL Expiry;CRITICAL;SOFT;1;CRITICAL - Certificate 'gerrit.wikimedia.org' expires in 7 day(s) (Sat 28 May 2022 08:33:22 PM GMT +0000).
Service Ok[2022-05-21 03:10:29] SERVICE ALERT: gerrit.wikimedia.org;Gerrit Health Check SSL Expiry;OK;SOFT;2;OK - Certificate 'gerrit.wikimedia.org' will expire on Wed 27 Jul 2022 08:27:52 PM GMT +0000.
Service Critical[2022-05-21 03:08:13] SERVICE ALERT: gerrit.wikimedia.org;Gerrit Health Check SSL Expiry;CRITICAL;SOFT;1;CRITICAL - Certificate 'gerrit.wikimedia.org' expires in 7 day(s) (Sat 28 May 2022 08:33:22 PM GMT +0000).
Service Ok[2022-05-21 03:03:37] SERVICE ALERT: gerrit.wikimedia.org;Gerrit Health Check SSL Expiry;OK;SOFT;3;OK - Certificate 'gerrit.wikimedia.org' will expire on Wed 27 Jul 2022 08:27:52 PM GMT +0000.
Service Critical[2022-05-21 03:01:23] SERVICE ALERT: gerrit.wikimedia.org;Gerrit Health Check SSL Expiry;CRITICAL;SOFT;2;CRITICAL - Certificate 'gerrit.wikimedia.org' expires in 7 day(s) (Sat 28 May 2022 08:33:22 PM GMT +0000).

Event Timeline

Mentioned in SAL (#wikimedia-releng) [2022-05-21T14:11:56Z] <hashar> Icinga reports Gerrit Health Check SSL Expiry errors filed as T308908

Another flap happened last night.

@RhinosF1 has suggested restarting apache since it still has a worker running from last month. The restart hasn't happened yet though.

dancy@gerrit1001:~$ ps -ef | grep apache2 | grep -v grep
www-data  1390 23733  0 Apr23 ?        07:39:55 /usr/sbin/apache2 -k start
www-data  6376 23733  1 00:00 ?        00:13:53 /usr/sbin/apache2 -k start
www-data  6377 23733  3 00:00 ?        00:34:57 /usr/sbin/apache2 -k start
www-data 19045 23733  2 06:13 ?        00:14:38 /usr/sbin/apache2 -k start
root     23733     1  0 Apr22 ?        00:02:11 /usr/sbin/apache2 -k start

I'm not restarting now in case someone wants to debug the problem in the weird state. If nothing happens today I'll restart it myself.

From apache-status:

Current Time: Wednesday, 25-May-2022 15:16:44 UTC

Restart Time: Friday, 22-Apr-2022 19:59:53 UTC
+------------------------------------------------------------------------+
|    |     |        |Connections    |Threads  |Async connections         |
|Slot|PID  |Stopping|---------------+---------+--------------------------|
|    |     |        |total|accepting|busy|idle|writing|keep-alive|closing|
|----+-----+--------+-----+---------+----+----+-------+----------+-------|
|0   |6376 |no      |1    |yes      |0   |25  |0      |0         |1      |
|----+-----+--------+-----+---------+----+----+-------+----------+-------|
|1   |19045|no      |2    |yes      |1   |24  |0      |0         |1      |
|----+-----+--------+-----+---------+----+----+-------+----------+-------|
|2   |1390 |no (old |0    |yes      |0   |25  |0      |0         |0      |
|    |     |gen)    |     |         |    |    |       |          |       |
|----+-----+--------+-----+---------+----+----+-------+----------+-------|
|3   |6377 |no      |3    |yes      |0   |25  |0      |2         |0      |
|----+-----+--------+-----+---------+----+----+-------+----------+-------|
|Sum |4    |0       |6    |         |1   |99  |0      |2         |2      |
+------------------------------------------------------------------------+

The slot with pid 1390 is marked no (old gen) which is the one from April 23 in ps above.

We have Apache 2.4.38-3+deb10u7

Looks like it is mostly a dupe of T293826.

@dancy lets indeed restart Apache entirely ;)

I restarted apache2 on gerrit1001.

+-----------------------------------------------------------------------+
|    |    |        |Connections    |Threads  |Async connections         |
|Slot|PID |Stopping|---------------+---------+--------------------------|
|    |    |        |total|accepting|busy|idle|writing|keep-alive|closing|
|----+----+--------+-----+---------+----+----+-------+----------+-------|
|0   |4840|no      |9    |yes      |1   |24  |0      |6         |2      |
|----+----+--------+-----+---------+----+----+-------+----------+-------|
|1   |4841|no      |10   |yes      |1   |24  |0      |7         |1      |
|----+----+--------+-----+---------+----+----+-------+----------+-------|
|Sum |2   |0       |19   |         |2   |48  |0      |13        |3      |
+-----------------------------------------------------------------------+
hashar assigned this task to dancy.

@dancy solved it by restarting Apache2. The root cause is somewhere in Apache 2 and is tracked by T293826