Page MenuHomePhabricator

mw2383 is misbehaving
Closed, ResolvedPublic

Description

mw2383 was presenting high load and high CPU usage when we switched over from eqiad to codfw (29 Jul 2021). At the time we depooled the server and restarted php-fpm. Today we noticed that it is exhibiting the same behaviour.

Looking at kernel messages, the following messages are logged in higher frequency than other mw* servers:

[Sat Jul 10 12:56:04 2021] traps: php-fpm7.2[15807] general protection ip:7f5cde938013 sp:7ffc2b8afa98 error:0 in libmemcached.so.11.0.0[7f5cde927000+30000]
[Sat Jul 10 14:26:42 2021] php-fpm7.2[14795]: segfault at 60000000e ip 00007f5cde93a2f9 sp 00007ffc2b8af880 error 4 in libmemcached.so.11.0.0[7f5cde927000+30000]

[Sun Jul 11 20:21:33 2021] Code: 00 31 db 4c 8d ac 24 e0 00 00 00 eb 07 0f 1f 40 00 83 c3 01 4c 89 ff e8 85 16 ff ff 39 d8 76 51 89 de 4c 89 ff e8 47 3f 00 00 <44> 8b 58 0c 49 89 c4 45 85 db 74 db 41 f6 47 01 10 0f 85 c8 02 00

[Mon Jul 12 07:17:38 2021] php-fpm7.2[12304]: segfault at 746e65746e00 ip 0000746e65746e00 sp 00007ffc2b8af288 error 14
[Mon Jul 12 07:17:38 2021] Code: Bad RIP value.

Looking at CPU frequency, it appears that something is triggering throttling and thus not allowing CPU freq to scale up and handle the load

mw2383

image.png (612×1 px, 76 KB)

mw2384

image.png (592×1 px, 78 KB)

Lastly, intel-microcode is up to date, and nothing interesting stood out when checking the management console log.

DC-Ops, can we check if rhe firmware is up to date? Server is depooled as it was increasing our overall latency.

Event Timeline

jijiki triaged this task as Medium priority.Jul 12 2021, 1:29 PM
jijiki updated the task description. (Show Details)
jijiki added subscribers: serviceops, DC-Ops.
RobH edited subscribers, added: RobH; removed: DC-Ops.

So I just happened to notice this, but in the future, please file requests using the form, as it outlines what has to happen. One of those things is assigning the proper site, which I've now corrected with appending in ops-codfw.

Basically my unrequested and unrequired monitoring of DC-Ops subscriptions is why I found this, but in the future its best to use the form and file it for the site in question directly, thanks!

I'll pull down the firmware and try to flash it shortly.

RobH moved this task from Backlog to Hardware Failure / Troubleshoot on the ops-eqiad board.
RobH edited projects, added ops-codfw; removed ops-eqiad.
RobH moved this task from Backlog to Hardware Failure / Troubleshoot on the ops-codfw board.

Effie,

I updated the firmware on this to the latest version, it hadn't been updated since it's purchase and was a couple of revisions out of date. It is now sitting back ready to be placed into service, I've not resolved this task since its not back in service.

I depooled mw2383 as it was behaving as before. @RobH is it possible to run hardware tests on the host? I suspect it is might be a hardware issue. Thank you!

I depooled mw2383 as it was behaving as before. @RobH is it possible to run hardware tests on the host? I suspect it is might be a hardware issue. Thank you!

Definitely, I'll go ahead and push this into the Dell ePSA testing suite to see if it finds the issue.

This is now running Dell's hardware test suite.

@jijiki can we depool mw2383 from scap if it's going to be down for an extended amount of time?

@Legoktm I will keep it in mind to mark it as inactive, thanks

<logmsgbot> !log ariel@puppetmaster1001 conftool action : set/pooled=inactive; selector: name=mw2383.codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2021-07-14T14:47:01Z] <effie> set mw2384 as inactive to investigate mw2383 issue - T286463

Summary of troubleshooting so far to see why this was throttling the CPU:

  • update idrac and bios firmware to latest revisions
    • this did not fix the issue when the host was returned to service post update
  • comparison of bios settings to working system mw2384, all settings are identical
  • run of Dell's ePSA test suite to see if any hardware reports failure, no errors.

At this point, I'm not sure what to try to duplicate this error of CPU throttling. Do the logs of the throttling denote which CPU is being throttled, or is it both of them?

If it is a single CPU, we can try swapping it out under support contract and see if the issue gets resolved.

Mentioned in SAL (#wikimedia-operations) [2021-07-14T14:47:01Z] <effie> set mw2384 as inactive to investigate mw2383 issue - T286463

I ACKed the Icinga alerts about mismatching Mediawiki version and not being in DSH groups. Just make sure to 'scap pull' before repooling.

There are no logs that indicate throttling but rather what I see from the graphs . It appears that the CPU does not scale up higher than ~1GHz. Also, the message [Mon Jul 12 07:17:38 2021] Code: Bad RIP value. yields that there might be a hardware error. I will try to find out if it is a single CPU being throttled.

@jijiki can you try to get the system back again in service? The system has 2 CPU's so it will be difficult to tell which CPU is bad if we do not have any logs telling which CPU is having issues. I will like to have the system back online so i get take a look as well when the the system is having issues. I think Rob did all the firmware upgrade which is great. If we do not find anything I will open a case with DELL

Thanks.

Papaul, Effie is on vacation for 2 more weeks. Just FYI, so don't expect any action here soon.

Mentioned in SAL (#wikimedia-operations) [2021-08-03T20:45:54Z] <rzl@cumin1001> START - Cookbook sre.hosts.downtime for 13 days, 0:00:00 on mw2383.codfw.wmnet with reason: T286463

Papaul lowered the priority of this task from Medium to Low.Aug 3 2021, 8:46 PM

Mentioned in SAL (#wikimedia-operations) [2021-08-03T20:46:02Z] <rzl@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 13 days, 0:00:00 on mw2383.codfw.wmnet with reason: T286463

Mentioned in SAL (#wikimedia-operations) [2021-08-17T08:21:45Z] <mutante> mw2383 - scap pull (still depooled because T286463 but alerts in Icinga since a while)

@Papaul I can put the server back in production, but when the server was active, we didn't get any kind of logs as to which CPU is problematic, if it is not both. I will put it back in prod tomorrow my morning, and get back to you. Thank you!

Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts:

['mw2383.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202108180950_jiji_30353.log.

Completed auto-reimage of hosts:

['mw2383.codfw.wmnet']

and were ALL successful.

@Papaul I reimaged the server and pooled it back. It was performing horribly, which is why I didn't keep it in pooled for more than half an hour. The behaviour was the same, it seems that it is unable to scale its frequency up, as opposed to mw2384 which is a similar server with similar traffic.

Apart from the previous errors I have posted here, I don't have anything else. The same messages would appear if I kept the server pooled I reckon, but its performance was very poor, I had to depool it.

mw2383

image.png (297×912 px, 29 KB)

mw2384

image.png (305×902 px, 38 KB)

@jijiki thank you i will open a case with Dell

Looking at the log today i have

2021-08-18 10:05:19 	PWR2400 	Power management firmware unable to maintain power limit.	
	
Log Sequence Number:
6709
Detailed Description:
The power management firmware cannot reduce power consumption to meet configuration or user defined settings.
Recommended Action:
1) Review system power policy.2) Check system event logs for thermal or power related exceptions.

@jijiki Please see below for the issue we found. The Power cap was enable setting the cap limits at 128watts below the recommended range of 213-355 watts) which was less then what both CPU's needed. I disable the Power cap limit like on mw2384. We can try to put back the server in production so we can see if this resolve the issue.
thanks

Active Power Cap Policy 	iDRAC: 128 Watts; 437 BTU/hr
Power Cap 	                        Enable
Power Cap Limits* 	       128Watts         (Recommended Range : 213 - 355 watts)

@jijiki it looks like mw2383 is happy now can we close this task ?

Thanks

jijiki closed this task as Resolved.EditedAug 19 2021, 2:01 PM

@Papaul, server is pooled and works as it should, thank you very much for finding this!