June 2021: appservers accumulating active php-fpm workers, requiring rolling restarts to avoid user-visible latency impact
Closed, ResolvedPublicPRODUCTION ERROR
Actions

Assigned To

Authored By

	CDanis
	Jun 27 2021, 4:30 PM

Description

Since approx Friday June 25, appservers seem to be accumulating 'stuck' active workers, which eventually causes a large tail latency impact.

Number of appservers with zero idle php-fpm workers: https://w.wiki/3YvS

Tail latency impact:

@Dzahn and @elukey and others have been doing various rolling restarts as seen in SAL.

The cause is not clear; perhaps related to a certain kind of request, or perhaps a newly-introduced bug?

There's also a notable correlation with increased cache miss / regenerations on WANObjectCache types SqlBlobStore_blob and also filerepo_file_foreign_description. Unclear if this is a symptom or a cause.

https://w.wiki/3YvU

Details

Subject	Repo	Branch	Lines +/-
Use EntityLookup backed TermLookup for Rdf PropertyStubs	mediawiki/extensions/Wikibase	master	+13 -9
Use EntityLookup backed TermLookup for Rdf PropertyStubs	mediawiki/extensions/Wikibase	wmf/1.37.0-wmf.12	+13 -9
Use EntityLookup backed TermLookup for Rdf PropertyStubs	mediawiki/extensions/Wikibase	wmf/1.37.0-wmf.11	+13 -9

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved	Release	• dduvall	T281152 1.37.0-wmf.11 deployment blockers
		Resolved	PRODUCTION ERROR	Ladsgroup	T285634 June 2021: appservers accumulating active php-fpm workers, requiring rolling restarts to avoid user-visible latency impact

Event Timeline

CDanis created this task.Jun 27 2021, 4:30 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 27 2021, 4:30 PM

CDanis triaged this task as High priority.Jun 27 2021, 4:30 PM

Some notes from the restarts:

The appservers with a lot of busy workers eventually failed pybal health checks and were not able to respond to php7adm /apcu-frag (see T285593)
Before I started last big batch of restarts (~30) I checked php7adm /apcu-frag on all those nodes and the fragmentation was in the range of 75%-85%
After a php-fpm restart the appserver returned into an healthy state without getting back into the weird state (so causing latency problems etc..). In other words, I had to restart php-fpm on appservers only once for each server.

The feeling that I have is that for some reason the APCu fragmentation on appservers started to grow steadily, and the impact to latency was the side effect of having one or more appservers running with more than 90% of APCu fragmentation.

jijiki subscribed.Jun 28 2021, 9:52 AM

and also filerepo_file_foreign_description

This is something that @Ladsgroup and @tstarling have been working on, specifically around file metadata. I did notice https://gerrit.wikimedia.org/r/c/mediawiki/core/+/699907 in wmf.11 adding/modifying some recursion, but it looks correct to me, assuming the $metadata isn't super deep.

Just as reminder, mw1384 was depooled in the second wave of appserver issues (over the weekend), the aim was to use it to investigate further. Some details about what I currently see:

elukey@mw1384:~$ php7adm opcache-info | jq '.'
{
  "opcache_enabled": true,
  "cache_full": false,
  "restart_pending": false,
  "restart_in_progress": false,
  "memory_usage": {
    "used_memory": 540358888,
    "free_memory": 428286856,
    "wasted_memory": 105096080,
    "current_wasted_percentage": 9.78783518075943
  },
  "interned_strings_usage": {
    "buffer_size": 100663296,
    "used_memory": 38252016,
    "free_memory": 62411280,
    "number_of_strings": 843283
  },
  "opcache_statistics": {
    "num_cached_scripts": 12237,
    "num_cached_keys": 20577,
    "max_cached_keys": 32531,
    "hits": 34126771539,
    "start_time": 1624458727,
    "last_restart_time": 0,
    "oom_restarts": 0,
    "hash_restarts": 0,
    "manual_restarts": 0,
    "misses": 19543,
    "blacklist_misses": 0,
    "blacklist_miss_ratio": 0,
    "opcache_hit_rate": 99.99994273414117
  }
}

elukey@mw1384:~$ php7adm apcu-frag
<!DOCTYPE html>
<html lang="en" dir="ltr">
<meta charset="utf-8">
<title>Wikimedia Error</title>
[..]

elukey@mw1384:~$ sudo journalctl -u php7.2-fpm | tail -n 2
Jun 29 06:02:59 mw1384 php7.2-fpm[36464]: @cee: {"channel":"exception","message":"[YNq3klJLXfqpAEce5jvd5gAAAAQ] PHP Fatal error: Allowed memory size of 524288000 bytes exhausted (tried to allocate 67108872 bytes) in /var/www/php-monitoring/lib.php:41","exception":{"message":"Allowed memory size of 524288000 bytes exhausted (tried to allocate 67108872 bytes)","file":"/var/www/php-monitoring/lib.php:41","trace":"#0 [internal function]: unknown()\n#1 /var/www/php-monitoring/lib.php(41): apcu_sma_info()\n#2 /var/www/php-monitoring/lib.php(348): apcu_frag()\n#3 /var/www/php-monitoring/index.php(41): show_apcu_frag()\n"},"caught_by":"/etc/php/php7-fatal-error.php via php-wmerrors","phpversion":"7.2.34-18+0~20210223.60+debian10~1.gbpb21322+wmf1","servergroup":"appserver","type":"mediawiki","normalized_message":"Allowed memory size of 524288000 bytes exhausted (tried to allocate 67108872 bytes) in lib.php","reqId":"YNq3klJLXfqpAEce5jvd5gAAAAQ","url":"/apcu-frag","ip":"::1","http_method":"GET","server":"wikimedia.org"}
Jun 29 06:02:59 mw1384 php7.2-fpm[36464]: PHP Fatal error:  Allowed memory size of 524288000 bytes exhausted (tried to allocate 67108872 bytes) in /var/www/php-monitoring/lib.php on line 41

elukey@mw1384:~$ free -m
              total        used        free      shared  buff/cache   available
Mem:          63879       12080       10930        6525       40868       44743
Swap:             0           0           0

The php slow log shows entries up to the depool time, and most of the stack traces end up in apcu-related calls. Can't really find much more, but we should either debug further the host or just restart php-fpm and repool it.

Mentioned in SAL (#wikimedia-operations) [2021-06-29T08:21:38Z] <elukey> depool mw1355 (mw appserver) for debugging - T285634

The problem seems to be quite clearly caused by excessive apcu locking. Let's review the symptoms:

The slow log shows all long-running requests are stuck waiting for apcu
The cpu fragmentation is clearly very high when this happens
The apcu fragmentation check runs out of memory on these appservers, showing we have way too many slots to compute

We should backtrack which patch caused this by looking at the timing too.

Mentioned in SAL (#wikimedia-operations) [2021-06-29T08:25:44Z] <elukey> cumin 'A:mw-eqiad' '/usr/local/sbin/restart-php7.2-fpm' -b 2 -s 30 - T285634

Also: the memory gets exhausted by this operation:

$sma_info = apcu_sma_info();

this means that that data structure is larger than 600 MB on those servers, again contributing to the idea that the problem are many small slots being created in acpu, and the general acpu usage going up significantly.

The recurring problem seems to have started last friday, June 25th, in the first hours of the day. This seems to point to the latest train release as the culprit:

19:10 dduvall@deploy1002: rebuilt and synchronized wikiversions files: all wikis to 1.37.0-wmf.11

Adding relevant tags because this seems like a train blocker; also setting the priority as UBN! as we have to do rolling restarts of appservers daily, and that's not ok.

Joe added a parent task: T281152: 1.37.0-wmf.11 deployment blockers.Jun 29 2021, 8:36 AM

In T285634#7181000, @Legoktm wrote:

and also filerepo_file_foreign_description

This is something that @Ladsgroup and @tstarling have been working on, specifically around file metadata.

I don't think I changed ForeignDBFile::getDescriptionText() or anything in that area. I might have changed some other WANObjectCache calls, but not that one.

In general, WANObjectCache is just meant to work. If calling getWithSetCallback() too much causes this issue, that's a bug in itself, never mind what is calling it. It should either be configured in a different way, or it should be smart enough to not break things.

Mentioned in SAL (#wikimedia-operations) [2021-06-29T08:47:21Z] <elukey> repool mw13[55,84] after debugging - T285634

tstarling added a subscriber: aaron.Jun 29 2021, 8:48 AM

I don't know if it helps, the increasing numbers of busy workers correlates with an elevated number of apcu gets https://w.wiki/3ZMn

Addshore subscribed.Jun 29 2021, 9:15 AM

@jijiki I think this correlation is another hint that what we get is some form of deadlock in apcu under excessive pressure. Why that happens is kind-of a mystery though at the moment.

If you expand the above graphs you can clearly see something changed around the night between the 24th and the 25th

I have a suspicion that this Wikibase cache is related

https://grafana.wikimedia.org/d/u5wAugyik/wikibase-statsdrecordingsimplecache?orgId=1&from=now-7d&to=now

This shows a 5-100x hit rate

also some non trival increases in miss rate

Addshore added projects: Wikidata, [DEPRECATED] wdwb-tech (wdwb-tech (small tasks)).Jun 29 2021, 9:24 AM

Restricted Application edited projects, added [DEPRECATED] wdwb-tech; removed [DEPRECATED] wdwb-tech (wdwb-tech (small tasks)). · View Herald TranscriptJun 29 2021, 9:24 AM

On to find and revert/fix the culprit

Restricted Application added a project: User-Ladsgroup. · View Herald TranscriptJun 29 2021, 9:28 AM

I think what @Addshore just found is a good candidate for being the source of the issue.

I'll try and get some more info from apcu on servers, although they've all been recently restarted to ease the pressure. I will take several snapshots of the apcu metadata during the day.

Sadly, php-fpm doesn't expose apcu data, just the metadata. We might be able to extract the data with some gdb wizardry, but I'd leave that as a last resort.

Scavenging the production logs, we found that Special:EntityData requests for rdf documents were possibly the culprit.

This is the result of profiling http://www.wikidata.org/wiki/Special:EntityData/Q146190.rdf : https://performance.wikimedia.org/xhgui/run/view?id=60daeb09163043ed29f0b1db

We perform 340k apcu_store calls and 178k apcu_get calls just for that page.

Scavenging the production logs, we found that Special:EntityData requests for rdf documents were possibly the culprit.

Did the code change, or is it just someone is spidering Special:EntityData, hitting that code an awefull lot?

In T285634#7183188, @daniel wrote:

Scavenging the production logs, we found that Special:EntityData requests for rdf documents were possibly the culprit.

Did the code change, or is it just someone is spidering Special:EntityData, hitting that code an awefull lot?

I'd say the former, given the perfect time correspondence between wmf.11 being deployed to group 1 and sudden increases in apcu activity.

Change 702096 had a related patch set uploaded (by Tarrow; author: Tarrow):

[mediawiki/extensions/Wikibase@master] Use EntityLookup backed TermLookup for Rdf PropertyStubs

https://gerrit.wikimedia.org/r/702096

gerritbot added a project: Patch-For-Review.Jun 29 2021, 10:42 AM

Change 702018 had a related patch set uploaded (by Ladsgroup; author: Tarrow):

[mediawiki/extensions/Wikibase@wmf/1.37.0-wmf.11] Use EntityLookup backed TermLookup for Rdf PropertyStubs

https://gerrit.wikimedia.org/r/702018

Change 702019 had a related patch set uploaded (by Ladsgroup; author: Tarrow):

[mediawiki/extensions/Wikibase@wmf/1.37.0-wmf.12] Use EntityLookup backed TermLookup for Rdf PropertyStubs

https://gerrit.wikimedia.org/r/702019

• Tonina_Zhelyazkova_WMDE subscribed.Jun 29 2021, 11:08 AM

In T285634#7183188, @daniel wrote:

Did the code change, or is it just someone is spidering Special:EntityData, hitting that code an awefull lot?

Yes, entities mentioned by other entities started being looked up from secondary data.

https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/699417 is where this started for all entities and https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/698474/12/repo/includes/Rdf/RdfBuilder.php#421 is where it happened a little earlier for just properties.

However, with wmf10 being skipped looks like both happened at the same time.

Change 702018 merged by jenkins-bot:

[mediawiki/extensions/Wikibase@wmf/1.37.0-wmf.11] Use EntityLookup backed TermLookup for Rdf PropertyStubs

https://gerrit.wikimedia.org/r/702018

Addshore moved this task from Inbox to Active on the [DEPRECATED] wdwb-tech board.Jun 29 2021, 11:35 AM

Mentioned in SAL (#wikimedia-operations) [2021-06-29T11:35:10Z] <ladsgroup@deploy1002> sync-file aborted: Backport: [[gerrit:702018|Use EntityLookup backed TermLookup for Rdf PropertyStubs (T285634)]] (duration: 00m 10s)

Mentioned in SAL (#wikimedia-operations) [2021-06-29T11:36:47Z] <ladsgroup@deploy1002> Synchronized php-1.37.0-wmf.11/extensions/Wikibase/repo/includes/Rdf/PropertyStubRdfBuilder.php: Backport: [[gerrit:702018|Use EntityLookup backed TermLookup for Rdf PropertyStubs (T285634)]], Part I (duration: 00m 56s)

Mentioned in SAL (#wikimedia-operations) [2021-06-29T11:38:03Z] <ladsgroup@deploy1002> Synchronized php-1.37.0-wmf.11/extensions/Wikibase/repo/: Backport: [[gerrit:702018|Use EntityLookup backed TermLookup for Rdf PropertyStubs (T285634)]], Part II (duration: 00m 58s)

This is basically done, we just need to wait to see if it continues to happen.

Change 702096 merged by jenkins-bot:

[mediawiki/extensions/Wikibase@master] Use EntityLookup backed TermLookup for Rdf PropertyStubs

https://gerrit.wikimedia.org/r/702096

ReleaseTaggerBot added a project: MW-1.37-notes (1.37.0-wmf.11; 2021-06-21).Jun 29 2021, 12:00 PM

Change 702019 merged by jenkins-bot:

[mediawiki/extensions/Wikibase@wmf/1.37.0-wmf.12] Use EntityLookup backed TermLookup for Rdf PropertyStubs

https://gerrit.wikimedia.org/r/702019

Ladsgroup moved this task from To Do (prioritised from top to bottom) to Test (Verification) on the Wikidata-Campsite (Wikidata-Campsite-Iteration-∞ (On Hold)) board.Jun 29 2021, 12:07 PM

Maintenance_bot removed a project: Patch-For-Review.Jun 29 2021, 12:10 PM

ReleaseTaggerBot edited projects, added MW-1.37-notes (1.37.0-wmf.12; 2021-06-28); removed MW-1.37-notes (1.37.0-wmf.11; 2021-06-21).Jun 29 2021, 1:00 PM

Data on the number of apcu gets/s normalized after the release: https://grafana-rw.wikimedia.org/goto/4-r5Lhk7z

I'm going to optimistically resolve this bug.

Jdforrester-WMF edited projects, added MW-1.37-notes (1.37.0-wmf.11; 2021-06-21); removed MW-1.37-notes (1.37.0-wmf.12; 2021-06-28).Jun 29 2021, 1:45 PM

	F34531442: image.png
	Jun 29 2021, 9:12 AM

	F34531439: image.png
	Jun 29 2021, 9:12 AM

June 2021: appservers accumulating active php-fpm workers, requiring rolling restarts to avoid user-visible latency impactClosed, ResolvedPublicPRODUCTION ERRORActions

Description

Details

Related ObjectsSearch...

Event Timeline

June 2021: appservers accumulating active php-fpm workers, requiring rolling restarts to avoid user-visible latency impact
Closed, ResolvedPublicPRODUCTION ERROR
Actions

Related Objects
Search...