Page MenuHomePhabricator

cp1089 memory errors on DIMM_B1
Closed, ResolvedPublic

Description

cp1089.eqiad.wmnet reports the following error from racadm getsel:

-------------------------------------------------------------------------------
Record:      40
Date/Time:   06/10/2022 18:35:03
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B1.
-------------------------------------------------------------------------------

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2022-06-10T18:40:04Z] <sukhe@cumin2002> START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on cp1089.eqiad.wmnet with reason: downtimed because of DIMM replacement: T310387

Mentioned in SAL (#wikimedia-operations) [2022-06-10T18:40:10Z] <sukhe@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on cp1089.eqiad.wmnet with reason: downtimed because of DIMM replacement: T310387

Cmjohnson claimed this task.
Cmjohnson subscribed.

The server was out of warranty, I swapped DIMM B1 with a DIMM from a spare. Server booted, no issues.

The server was out of warranty, I swapped DIMM B1 with a DIMM from a spare. Server booted, no issues.

Thanks for your help @Cmjohnson!