contint1001.wikimedia.org is almost unresponsive
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	hashar
	Jan 19 2022, 4:35 PM

Description

On January 19th around 14:45, contint1001.wikimedia.org started being slow with high CPU / memory / disk io.

https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=contint1001&var-datasource=eqiad%20prometheus%2Fops&var-cluster=ci&from=now-3h&to=now

The Jenkins controller can't ssh to it:

Jan 19 14:56:31 contint2001 jenkins[998]: [01/19/22 14:56:31] SSH Launch of contint1001 on 208.80.154.17 failed in 65,004 ms
Jan 19 15:17:32 contint2001 jenkins[998]: [01/19/22 15:17:32] SSH Launch of contint1001 on 208.80.154.17 completed in 1,206,232 ms
Jan 19 15:20:19 contint2001 jenkins[998]: [01/19/22 15:20:19] SSH Launch of contint1001 on 208.80.154.17 completed in 53,227 ms
Jan 19 15:46:31 contint2001 jenkins[998]: [01/19/22 15:46:31] SSH Launch of contint1001 on 208.80.154.17 failed in 65,029 ms
Jan 19 15:48:31 contint2001 jenkins[998]: [01/19/22 15:48:31] SSH Launch of contint1001 on 208.80.154.17 failed in 65,004 ms
Jan 19 15:50:31 contint2001 jenkins[998]: [01/19/22 15:50:31] SSH Launch of contint1001 on 208.80.154.17 failed in 65,004 ms
Jan 19 15:52:31 contint2001 jenkins[998]: [01/19/22 15:52:31] SSH Launch of contint1001 on 208.80.154.17 failed in 65,004 ms
Jan 19 15:54:31 contint2001 jenkins[998]: [01/19/22 15:54:31] SSH Launch of contint1001 on 208.80.154.17 failed in 65,005 ms
Jan 19 16:10:15 contint2001 jenkins[998]: [01/19/22 16:10:15] SSH Launch of contint1001 on 208.80.154.17 failed in 65,004 ms

From htop, the host has 293 tasks, 2158 threads.

Related Objects

Mentioned In: T290608: Pipeline lib still leaks containers on contint1001 / contint2001
Mentioned Here: T290608: Pipeline lib still leaks containers on contint1001 / contint2001
T296202: Migrate Termbox to Vue 3

Event Timeline

hashar created this task.Jan 19 2022, 4:35 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 19 2022, 4:35 PM

There are bunch of processes such as /usr/bin/node /opt/lib/node_modules/jest-worker/build/workers/processChild.js

Mentioned in SAL (#wikimedia-operations) [2022-01-19T16:36:46Z] <hashar> marking contint1001.wikimedia.org as offline in Jenkins since it is dramatically overloaded T299542

I think that comes from termbox, the first event in zuul would be:

2022-01-19 14:45:40,777 DEBUG zuul.DependentPipelineManager: Found job trigger-termbox-pipeline-rehearse for change <Change 0x7f01545b2cd0 753798,3>

I will ask around to get the host powercycled :-\

Hello ops-eqiad contint1001.wikimedia.org is unresponsive. Moritz tried to reach it out through the serial console but it did not work. Would you be able to remotely powercyle the host please?

This was probably caused for https://integration.wikimedia.org/ci/job/termbox-pipeline-rehearse/92/console, a gate-and-submit build for a Termbox change (T296202) :(

The retry https://integration.wikimedia.org/ci/job/termbox-pipeline-rehearse/94/consoleFull ran on contint2001 and finished successfully, though only after the triggering build had already timed out (30 min limit).

I’ll hold off on further gate-and-submit retries for this change.

Michael subscribed.Jan 19 2022, 5:01 PM

• Cmjohnson moved this task from Backlog to High Priority Task on the ops-eqiad board.Jan 19 2022, 5:09 PM

• Cmjohnson moved this task from High Priority Task to Hardware Failure / Troubleshoot on the ops-eqiad board.

@Cmjohnson acknowledged the issue and will be able to restart the host in a couple hours.

The services offered by contint1001 are zuul-merger and 12 jenkins executions slots for PipelineLib. The same services are also offered by contint2001, so although contint1001 is unresponsive the service is still offered (although with half the capacity).

Mentioned in SAL (#wikimedia-operations) [2022-01-19T17:26:16Z] <_joe_> powercycling contint1001 via ipmi, T299542

No need for further restarts, I was able to powercycle the server using ipmi. @Cmjohnson you don't need to do anything :)

Joe closed this task as Resolved.Jan 19 2022, 5:30 PM

Thank you very much for the powercycle.

Mentioned in SAL (#wikimedia-releng) [2022-01-19T17:31:17Z] <hashar> Adding https://integration.wikimedia.org/ci/computer/contint1001/ back to the pool after the machine got powercycled # T299542

Maintenance_bot added a project: SRE.Jan 19 2022, 5:45 PM

I think one of the follow up action is T290608 which is that obsolete intermediate Docker layers and containers are kept on the machine. They pill up without garbage collection, that would eventually make docker file operations to be slower than they should due to the amount of files / direntries that have to be crawled. Using hdd and lot of files, that does not play well.

hashar mentioned this in T290608: Pipeline lib still leaks containers on contint1001 / contint2001.Jan 19 2022, 6:20 PM

contint1001.wikimedia.org is almost unresponsiveClosed, ResolvedPublicActions

Description

Related Objects

Event Timeline

contint1001.wikimedia.org is almost unresponsive
Closed, ResolvedPublic
Actions