Page MenuHomePhabricator

codfw: es2021: Correctable memory error rate exceeded for DIMM_A1
Closed, ResolvedPublic

Description

IDRAC log is showing some errors on DIMM_A1. I will like to coordinate to take this server down sometimes next week to swap DIMM _A1 wirh DIMM_B1

Thanks.

Event Timeline

Papaul triaged this task as Medium priority.Sep 3 2021, 2:56 PM

@Papaul let's do that next week, which day/time would work for you?

@Marostegui will confirm next week with day and time.

Thanks.

I have agreed with @Papaul to do this after the switchover.

@Papaul let me know when you want this off to be powered off and I will have it ready for you.

@Marostegui I will be back on site tomorrow. if you are available, I can ping you while onsite.

Thank you.

Change 723807 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] es2021: Disable notifications

https://gerrit.wikimedia.org/r/723807

Change 723807 merged by Marostegui:

[operations/puppet@production] es2021: Disable notifications

https://gerrit.wikimedia.org/r/723807

Mentioned in SAL (#wikimedia-operations) [2021-09-27T11:43:25Z] <marostegui> Turn off es2021 for onsite maintenance T290327

@Papaul es2021 is now off and ready for you.

I have also upgraded the replicas to 10.4.21.

DIMM A1 swapped with DIMM B2 leaving the task open for now to monitoring if we will see the issue on DIMM B1

Thanks

Thanks @Papaul - mysql started. Let's give it a week and if nothing arises, let's close it

@Marostegui I checked the server today, all looks good . Resolving this task for now if we see the problem again we can re-open.

thanks .