Page MenuHomePhabricator

contint1001.wikimedia.org is almost unresponsive
Closed, ResolvedPublic

Description

On January 19th around 14:45, contint1001.wikimedia.org started being slow with high CPU / memory / disk io.

https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=contint1001&var-datasource=eqiad%20prometheus%2Fops&var-cluster=ci&from=now-3h&to=now

The Jenkins controller can't ssh to it:

Jan 19 14:56:31 contint2001 jenkins[998]: [01/19/22 14:56:31] SSH Launch of contint1001 on 208.80.154.17 failed in 65,004 ms
Jan 19 15:17:32 contint2001 jenkins[998]: [01/19/22 15:17:32] SSH Launch of contint1001 on 208.80.154.17 completed in 1,206,232 ms
Jan 19 15:20:19 contint2001 jenkins[998]: [01/19/22 15:20:19] SSH Launch of contint1001 on 208.80.154.17 completed in 53,227 ms
Jan 19 15:46:31 contint2001 jenkins[998]: [01/19/22 15:46:31] SSH Launch of contint1001 on 208.80.154.17 failed in 65,029 ms
Jan 19 15:48:31 contint2001 jenkins[998]: [01/19/22 15:48:31] SSH Launch of contint1001 on 208.80.154.17 failed in 65,004 ms
Jan 19 15:50:31 contint2001 jenkins[998]: [01/19/22 15:50:31] SSH Launch of contint1001 on 208.80.154.17 failed in 65,004 ms
Jan 19 15:52:31 contint2001 jenkins[998]: [01/19/22 15:52:31] SSH Launch of contint1001 on 208.80.154.17 failed in 65,004 ms
Jan 19 15:54:31 contint2001 jenkins[998]: [01/19/22 15:54:31] SSH Launch of contint1001 on 208.80.154.17 failed in 65,005 ms
Jan 19 16:10:15 contint2001 jenkins[998]: [01/19/22 16:10:15] SSH Launch of contint1001 on 208.80.154.17 failed in 65,004 ms

From htop, the host has 293 tasks, 2158 threads.

Event Timeline

hashar triaged this task as Unbreak Now! priority.Jan 19 2022, 4:36 PM

There are bunch of processes such as /usr/bin/node /opt/lib/node_modules/jest-worker/build/workers/processChild.js

Mentioned in SAL (#wikimedia-operations) [2022-01-19T16:36:46Z] <hashar> marking contint1001.wikimedia.org as offline in Jenkins since it is dramatically overloaded T299542

I think that comes from termbox, the first event in zuul would be:

2022-01-19 14:45:40,777 DEBUG zuul.DependentPipelineManager: Found job trigger-termbox-pipeline-rehearse for change <Change 0x7f01545b2cd0 753798,3>

I will ask around to get the host powercycled :-\

Hello ops-eqiad contint1001.wikimedia.org is unresponsive. Moritz tried to reach it out through the serial console but it did not work. Would you be able to remotely powercyle the host please?

This was probably caused for https://integration.wikimedia.org/ci/job/termbox-pipeline-rehearse/92/console, a gate-and-submit build for a Termbox change (T296202) :(

The retry https://integration.wikimedia.org/ci/job/termbox-pipeline-rehearse/94/consoleFull ran on contint2001 and finished successfully, though only after the triggering build had already timed out (30 min limit).

I’ll hold off on further gate-and-submit retries for this change.

@Cmjohnson acknowledged the issue and will be able to restart the host in a couple hours.

The services offered by contint1001 are zuul-merger and 12 jenkins executions slots for PipelineLib. The same services are also offered by contint2001, so although contint1001 is unresponsive the service is still offered (although with half the capacity).

Mentioned in SAL (#wikimedia-operations) [2022-01-19T17:26:16Z] <_joe_> powercycling contint1001 via ipmi, T299542

No need for further restarts, I was able to powercycle the server using ipmi. @Cmjohnson you don't need to do anything :)

Thank you very much for the powercycle.

I think one of the follow up action is T290608 which is that obsolete intermediate Docker layers and containers are kept on the machine. They pill up without garbage collection, that would eventually make docker file operations to be slower than they should due to the amount of files / direntries that have to be crawled. Using hdd and lot of files, that does not play well.