Test running php7.2 and php7.4 in parallel on the beta cluster
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Joe
	Nov 12 2021, 11:01 AM

Description

We need to test that php 7.2 and php 7.4 can coexist safely in deplopyment-prep. To this end we need to:

Install an appserver with both
Install a jobrunner with both
Install both on the deployment server. This is a good oppportunity to test stages of scap that need php.
Install on the maintenance server, test some periodic scripts
Install on a dumps test server, run tests. T295580

For configuration of changeprop, there is https://wikitech.wikimedia.org/wiki/Changeprop#To_deployment-prep but the steps described there don't seem to generate a valid configuration currently.

And we should test basic functionality on all of those.

Details

Subject	Repo	Branch	Lines +/-
make_beta_config.py: run helm as helm3	operations/deployment-charts	master	+1 -1
changeprop: use helm3 semantics for generating beta config	operations/deployment-charts	master	+1 -1
Stop loading wddx PHP extension with PHP 7.4	operations/puppet	production	+7 -1
deployment-prep: install php 7.4 everywhere	operations/puppet	production	+3 -3

Customize query in gerrit

Related Objects
Search...

Status	Subtype	Assigned	Task
Resolved		None	T248925 Make MediaWiki release tarball compatible with PHP 8.0
Resolved		Jdforrester-WMF	T300463 Make PHP 8.0 voting on MW master
Resolved		Jdforrester-WMF	T313563 Bump lcobucci/jwt & league/uri-components for php8
Resolved		Jdforrester-WMF	T313564 Bump onoi/message-reporter in vendor.git to 1.4.2 for php 8 support
Resolved		Jdforrester-WMF	T247658 Make Wikimedia Production MediaWiki compatible with PHP 7.4
Resolved		• toan	T243590 Fix WikibaseDataModel CI for php 7.4
Resolved		Lucas_Werkmeister_WMDE	T316923 Restore skipped test in ReferenceListTest.php
Resolved		Joe	T318918 Undeploy patch to use old PHP serialization in PHP 7.4
Resolved		Jdforrester-WMF	T264168 Drop PHP 7.2 support from Wikibase master branch, once Wikimedia production is on 7.4
Resolved		Ladsgroup	T270740 Drop hacky support for doctrine/dbal class renaming
Invalid		None	T303505 [S] Remove Deprecated methods "serialize" and "unserialize" after php production upgrade to PHP 7.4
Resolved		Reedy	T251043 Cleanup css-sanitizer when it only requires PHP >= 7.4
Open		None	T166010 The Great Namespaceization Effort
Resolved		tstarling	T277618 var_dump() on various objects writes gigabytes of data and takes minutes to run
Resolved		Jdforrester-WMF	T261872 Drop PHP 7.2 & 7.3 support from MediaWiki master branch, once Wikimedia production is on 7.4
Resolved	PRODUCTION ERROR	Legoktm	T293568 PHP Notice: Undefined offset in wikimedia/remex-html when rendering rest.php error page
Resolved		tstarling	T297667 mysqli/mysqlnd memory leak
Resolved		Joe	T271736 Migrate WMF production from PHP 7.2 to PHP 7.4
Resolved		Joe	T293450 Allow coexisting php version in our puppet code
Resolved		tstarling	T295578 Test running php7.2 and php7.4 in parallel on the beta cluster
Resolved		ArielGlenn	T295580 Test php7.4 for dumps generation

Event Timeline

Joe created this task.Nov 12 2021, 11:01 AM

I have installed a fresh appserver whith both php 7.2 and 7.4, and:

Request get redirected correctly based on the php_engine cookie
monitoring reports data about both engines
Responses from 7.4 don't obviously miss details
php7.2 is chosen by default, and is the fallback in any case.

Joe updated the task description. (Show Details)Nov 12 2021, 11:10 AM

ArielGlenn closed subtask T295580: Test php7.4 for dumps generation as Resolved.Dec 3 2021, 6:13 AM

Joe triaged this task as Medium priority.Jan 5 2022, 9:06 AM

Change 755536 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/puppet@production] deployment-prep: install php 7.4 everywhere

https://gerrit.wikimedia.org/r/755536

gerritbot added a project: Patch-For-Review.Jan 20 2022, 8:10 AM

Change 755536 merged by JMeybohm:

[operations/puppet@production] deployment-prep: install php 7.4 everywhere

https://gerrit.wikimedia.org/r/755536

Maintenance_bot removed a project: Patch-For-Review.Feb 28 2022, 3:11 PM

re-assigning to Janis as he's been doing work in this direction in the last week.

taavi mentioned this in T302699: Beta cluster down: Error: 502, Next Hop Connection Failed (Feb 2022).Mar 2 2022, 11:36 AM

Puppet is failing on deployment-parsoid12:

Mar 03 07:07:58 deployment-parsoid12 php[31713]: PHP Warning:  PHP Startup: Unable to load dynamic library 'wddx.so' (tried: /usr/lib/php/20190902/wddx.so (/usr/lib/php/20190902/wddx.so: cannot open shared object file: No such file or directory), /usr/lib/php/20190902/wddx.so.so (/usr/lib/php/20190902/wddx.so.so: cannot open shared object file: No such file or directory)) in Unknown on line 0

Thanks @Majavah !

wddx is no longer bundled with PHP >=7.4 and T295725 suggests we could just stop loading it.

Change 769745 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] Stop loading wddx PHP extension with PHP 7.4

https://gerrit.wikimedia.org/r/769745

gerritbot added a project: Patch-For-Review.Mar 10 2022, 6:14 PM

Change 769745 merged by JMeybohm:

[operations/puppet@production] Stop loading wddx PHP extension with PHP 7.4

https://gerrit.wikimedia.org/r/769745

Maintenance_bot removed a project: Patch-For-Review.Mar 14 2022, 9:11 AM

Manual tests (with and without cookie PHP_ENGINE=7.4, as well as invalid values) against appserver and parsoid seem fine to me so far

JMeybohm updated the task description. (Show Details)Mar 17 2022, 10:38 AM

JMeybohm updated the task description. (Show Details)Mar 21 2022, 3:57 PM

hnowlan subscribed.Apr 28 2022, 3:42 PM

Change 787526 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] changeprop: use helm3 semantics for generating beta config

https://gerrit.wikimedia.org/r/787526

gerritbot added a project: Patch-For-Review.Apr 28 2022, 3:47 PM

There were some issues with the make_beta_config.py script for helm3, there's a CR open to fix it. In general is the plan to create an entirely independent second instance of changeprop or just to update its configuration?

Change 787526 merged by jenkins-bot:

[operations/deployment-charts@master] changeprop: use helm3 semantics for generating beta config

https://gerrit.wikimedia.org/r/787526

Ladsgroup mentioned this in T308443: Phan broken due to ResourceLoader namespace move.May 17 2022, 4:23 PM

I spent a couple of hours familiarising myself with how cpjobqueue is configured and how it works in the beta cluster. I still don't understand why cpjobqueue needs to be part of this project. The task description says "install a jobrunner with both", but that is already done.

A reasonable deployment sequence for job execution might be:

Run one job with PHP 7.4
- This can be done by sending a fake job with curl, no need for cpjobqueue.
Run some subset of real jobs with PHP 7.4
- The implementation depends on what subset you want. Several options can be implemented in the Apache config without touching cpjobqueue.
Run all jobs with PHP 7.4
- This can be done in the Apache config.

I think the only option for step 2 that would need cpjobqueue involvement is if cookies are to be forwarded from a developer's browser to MediaWiki, then to cpjobqueue to be stored in Kafka, then back to MediaWiki again, similar to how request ID forwarding works. This is a complex option requiring changes to 3 or 4 repos.

I think my preferred option for step 2 is a random sample implemented in Apache.

I think my preferred option for step 2 is a random sample implemented in Apache.

OK, that turned out to be more involved than expected. Apache doesn't have an easy way to get random numbers, and mod_proxy_balancer doesn't work for FastCGI backends. The complexity of the remaining solutions look similar to doing it in cpjobqueue.

If cpjobqueue just provided a random integer in a header, Apache could take it from there. Or I could do cookie forwarding through cpjobqueue. Having cpjobqueue decide what version to send means we have to reconfigure it to turn the experiment on and off, which I want to avoid since the difficulty of reconfiguring this service is evidently so great that it derailed the whole project for months.

In T295578#7984867, @tstarling wrote:

Having cpjobqueue decide what version to send means we have to reconfigure it to turn the experiment on and off, which I want to avoid since the difficulty of reconfiguring this service is evidently so great that it derailed the whole project for months.

cpjobqueue has been reconfigured in deployment-prep multiple times recently, it's documented and takes a matter of minutes to do.

I don't think we need per-jobtype control here. From my perspective, the important points here are:

we can switch PHP verisons by cookie for appservers in production, and we might as well test this out in beta.
we can control the PHP version used for a given jobrunner in production in an easy way, e.g. Hiera boolean variable, to let SRE ramp it up progressively, and again we might as well use the same mechanism to flip the (1) jobrunner in Beta to use php7.4.
we install both PHP versions on all jobrunners, so that it's quick to switch and to keep them as similar to appservers as possible.

If we approach this minimally with just that, then we don't need to insert any cookies or headers from cpjobqueue. I don't oppose it if it's a trivial config change to have cpjobqueue insert a cookie header on all RunJob requests, but it seems like that would require extra work, because it targets the loadbalancer not individual hosts and presumably doesn't have a built-in way to randomly set a header for a % of traffic. So on the assumption that it isn't a trivial config change, my suggestion would be to approach it only from the Apache side.

In hiera there is profile::mediawiki::php::php_versions, which is an array of PHP versions like [ "7.2", "7.4" ]. The default is the first element in the array. So we can test job execution on 7.4 by swapping the order of the versions for a specific server.

Let's assume there will be no cpjobqueue patch, and production pilot deployment will be done by switching to PHP 7.4 on a particular jobrunner using hiera.

I tested swapping the versions in horizon hiera. This caused puppet to break PHP-FPM by swapping the port numbers -- php7.4-fpm failed to start because port 8000 was still in use by php7.2-fpm.

A workaround for this is to just increment the port number each time the version set is changed.

profile::mediawiki::php::php_versions:
- '7.4'
- '7.2'
profile::php_fpm::fcgi_port: 8002

I tested this on deployment-jobrunner04 and it worked fine. It avoids the need to dive into the puppet classes and change the way we assign ports.

In T295578#7987463, @Krinkle wrote:

presumably doesn't have a built-in way to randomly set a header for a % of traffic

Right, I think that would be a new feature. But I talked to @JMeybohm just now, who said that the idea was to set the header for a specific job type, globally. That can be done in cpjobqueue configuration.

Yes, the idea was to have jobqueue send the header so it can be set per job type and would not require reconfiguration of the servers (as the PHP_VERSION cookie selector is already present).

For me ./make_beta_config.py . jobqueue now generates the config that is running in deployment-prep (which, IIRC, wasn't the case when I tried last time. There was a diff of a couple of hundred lines). So I guess something like the following patch would be enough to render a config that makes jobqueue send the cookie along:

diff --git a/charts/changeprop/templates/_jobqueue.yaml b/charts/changeprop/templates/_jobqueue.yaml
index 4dc34c16..6ac56c5f 100644
--- a/charts/changeprop/templates/_jobqueue.yaml
+++ b/charts/changeprop/templates/_jobqueue.yaml
@@ -143,6 +143,7 @@ spec: &spec
                   method: post
                   uri: '{{ $jobrunner_uri }}'
                   headers:
+                    cookie: 'PHP_VERSION=7.4'
                     content-type: 'application/json'
                     x-request-id: '{{ `{{globals.message.meta.request_id}}` }}'
                   body: '{{ `{{globals.message}}` }}'
@@ -160,6 +161,7 @@ spec: &spec
                   method: 'post'
                   uri: '/sys/partition/{{ $part_options.partitioner_kind }}/'
                   headers:
+                    cookie: 'PHP_VERSION=7.4'
                     content-type: "application/json"
                     x-request-id: '{{ `{{globals.message.meta.request_id}}` }}'
                   body: '{{ `{{globals.message}}` }}'
@@ -174,6 +176,7 @@ spec: &spec
                   method: post
                   uri: '{{ $jobrunner_uri }}'
                   headers:
+                    cookie: 'PHP_VERSION=7.4'
                     content-type: 'application/json'
                     x-request-id: '{{ `{{globals.message.meta.request_id}}` }}'
                   body: '{{ `{{globals.message}}` }}'
@@ -196,6 +199,7 @@ spec: &spec
                   method: post
                   uri: '{{ $videoscaler_uri }}'
                   headers:
+                    cookie: 'PHP_VERSION=7.4'
                     content-type: 'application/json'
                     x-request-id: '{{ `{{globals.message.meta.request_id}}` }}'
                   body: '{{ `{{globals.message}}` }}'
@@ -226,6 +230,7 @@ spec: &spec
                   method: post
                   uri: '{{ $jobrunner_uri }}'
                   headers:
+                    cookie: 'PHP_VERSION=7.4'
                     content-type: 'application/json'
                     x-request-id: '{{ `{{globals.message.meta.request_id}}` }}'
                   body: '{{ `{{globals.message}}` }}'

Sorry for some rambling, I'm not very familiar with docker. I'm exploring it and I figure others might have some useful observations. But if not, there will be gerrit commits and documentation edits.

On deployment-deploy03 I edited make_beta_config.py to use helm3 instead of helm (which is a symlink to helm2), then I ran it and ssh-piped the result to deployment-docker-cpjobqueue01:/home/tstarling/config.yaml. I ran docker container ls, then docker container inspect on the running container, which showed a bind mount from /var/lib/docker/volumes/cpjobqueue/_data on the host to /etc/cpjobqueue in the container. The generated configuration seems the same as the deployed configuration.

root@deployment-docker-cpjobqueue01:~# md5sum /var/lib/docker/volumes/cpjobqueue/_data/config.yaml ~tstarling/config.yaml
985961fbb17caed6ffc3f64063718852  /var/lib/docker/volumes/cpjobqueue/_data/config.yaml
985961fbb17caed6ffc3f64063718852  /home/tstarling/config.yaml

wt:Changeprop suggests docker run -it -v cpjobqueue:/srv/cpjobqueue alpine /bin/sh then inspecting /srv/cpjobqueue/config.yaml, which produces the same result. The new container has the bind mount /var/lib/docker/volumes/cpjobqueue/_data -> /srv/cpjobqueue so you see the same files that way, but inside the container you don't have the tools you need to update it.

The doc then suggests service cpjobqueue restart. There is /lib/systemd/system/cpjobqueue.service on the host which restarts the docker container.

Another way to get the location of the config file on the host is docker volume inspect cpjobqueue -f '{{.Mountpoint}}'. But it seems like it would be easier to use a fixed bind mount instead of a volume. I found an explanation for this weird situation in puppet/modules/service/manifests/docker.pp , apparently it is "just to bypass a 64KB limitation of horizon... The entire lifecycle management of that docker volume is left up to the user of the volume. So precreate it and prepopulate it please".

I think the volume was not precreated. That seems fixable...

I made /etc/cpjobqueue on the host but was not able to reconfigure the volume to use it. I think I would have to delete the container, which means I would have to know how to recreate it, which I don't.

By the way, I temporarily stopped puppet while investigating this, but systemctl start puppet does not work because /etc/puppet/puppet.conf has daemonize=false which breaks the systemd unit file. So I started puppet manually.

I applied @JMeybohm's patch to my working copy and deployed it to cpjobqueue01. Then I fixed it with s/PHP_VERSION/PHP_ENGINE/ and deployed it again. Then I confirmed with tcpdump that the jobrunner returns X-Powered-By: PHP/7.4.28.

tstarling updated the task description. (Show Details)Jun 10 2022, 2:28 AM

Change 804486 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[operations/deployment-charts@master] make_beta_config.py: run helm as helm3

https://gerrit.wikimedia.org/r/804486

Mentioned in SAL (#wikimedia-releng) [2022-06-10T04:03:34Z] <TimStarling> on deployment-deploy03 running scap sync-world -v with PHP 7.2 for T295578 sanity check

The next checklist item is to test scap with PHP 7.4. I reviewed scap and extracted the following list of scripts that it runs:

rebuildLocalisationCache.php
mergeMessageFileList.php
purgeMessageBlobStore.php
extensions/WikimediaMaintenance/refreshMessageBlobs.php
eval.php (check_fatals)
dumpInterwiki.php
php -l

All scripts should be executed by following this procedure:

rm /srv/mediawiki-staging/php-master/cache/l10n/l10n_cache-en.cdb
scap sync-world -v
scap sync-world -v -D delay_messageblobstore_purge:false
scap update-interwiki-cache -v --beta

I did that on deployment-deploy03 with PHP 7.2 and it seemed to work. All the executions shows up in the debug log except dumpInterwiki.php, so I confirmed that dumpInterwiki.php was being run using strace.

scap has the php_version config var, which is passed through to mwscript as the PHP environment variable, so that should hopefully work for most things, without the need for update-alternatives.

Mentioned in SAL (#wikimedia-releng) [2022-06-10T04:27:51Z] <TimStarling> on deployment-deploy03 running scap sync-world -v with PHP 7.4 for T295578

I ran

rm /srv/mediawiki-staging/php-master/cache/l10n/l10n_cache-en.cdb
scap sync-world -v -D php_version:php7.4
PHP=php7.4 scap sync-world -v -D delay_messageblobstore_purge:false -D php_version:php7.4
scap update-interwiki-cache -v --beta -D php_version:phpfoo
scap update-interwiki-cache -v --beta -D php_version:php7.4

Everything ran as PHP 7.4 except php -l. The php_version:phpfoo check confirmed that the PHP environment variable was passed through to mwscript without any special handling at the call site.

To test php -l I did:

find -O2 '/srv/mediawiki-staging/wmf-config' '/srv/mediawiki-staging/multiversion' -not -type d -name '*.php' -not -name 'autoload_static.php'  -or -name '*.inc' | xargs -n1 -P2 -exec php7.4 -l

Also, I noticed that scap.cfg has

php_fpm: php7.2-fpm
php_fpm_restart_script: /usr/local/sbin/check-and-restart-php
php_fpm_unsafe_restart_script: /usr/local/sbin/restart-php7.2-fpm
php_fpm_always_restart: True

That will need to be updated. But this test is complete and I'm checking the box.

tstarling updated the task description. (Show Details)Jun 10 2022, 4:47 AM

Sgs unsubscribed.Jun 13 2022, 11:56 AM

Change 804486 abandoned by Tim Starling:

[operations/deployment-charts@master] make_beta_config.py: run helm as helm3

Reason:

removed helm2

https://gerrit.wikimedia.org/r/804486

Mainframe98 subscribed.Jun 14 2022, 11:00 AM

I inspected the logs from the last 5 days of PHP 7.4 job runner execution on beta. The only relevant log entries I could find were three exceptions with the message "Failed to load data blob from Bad data in text row 23625" from dewiki. Using eval.php I confirmed that the exception occurs with both 7.2 and 7.4.

I ran the following maintenance scripts on the beta cluster with PHP 7.4:

mwscript maintenance/cleanupUploadStash.php --wiki=enwiki
- Broken independent of version, 49 files fail with "UploadStashBadPathException"
mwscript refreshLinks.php --wiki=enwiki --dfn-only
mwscript initSiteStats.php --wiki=enwiki --update
mwscript maintenance/getLagTimes.php --wiki=enwiki --report
mwscript extensions/GrowthExperiments/maintenance/deleteOldSurveys.php --wiki=enwiki --cutoff 60
mwscript extensions/GrowthExperiments/maintenance/fixLinkRecommendationData.php --wiki=enwiki --search-index --db-table --dry-run --statsd
mwscript extensions/GrowthExperiments/maintenance/listTaskCounts.php --wiki=enwiki --tasktype link-recommendation --topictype ores --statsd --output none
mwscript extensions/GrowthExperiments/maintenance/purgeExpiredMentorStatus.php --wiki=enwiki
mwscript extensions/GrowthExperiments/maintenance/refreshLinkRecommendations.php --wiki=enwiki --verbose
- Long running, eventually started failing with HTTP "429 Too Many Requests"
- error.log showed "PHP Notice: Trying to access array offset on value of type null" from LinkRecommendationUpdater.php lines 208 and 209.
- Also "PHP Warning: curl_multi_remove_handle(): supplied resource is not a valid cURL Multi Handle resource" but this was probably unrelated log traffic, vague backtrace Monolog\Handler\Handler->__destruct, thousands of events per day in archives.
mwscript extensions/GrowthExperiments/maintenance/updateMenteeData.php --wiki=enwiki --statsd --dbshard s1
mwscript extensions/PageAssessments/maintenance/purgeUnusedProjects.php --wiki=enwiki
mwscript extensions/PageTriage/cron/updatePageTriageQueue.php --wiki=enwiki
mwscript extensions/AbuseFilter/maintenance/PurgeOldLogIPData.php --wiki=enwiki
mwscript extensions/CheckUser/maintenance/purgeOldData.php --wiki=enwiki
mwscript maintenance/purgeExpiredBlocks.php --wiki=enwiki
mwscript maintenance/purgeExpiredUserrights.php --wiki=enwiki
mwscript extensions/SecurePoll/cli/purgePrivateVoteData.php --wiki=enwiki
mwscript recountCategories.php --wiki=enwiki --mode all
mwscript extensions/WikimediaMaintenance/blameStartupRegistry.php --wiki=enwiki --record-stats
- "MWException: File '/srv/mediawiki/php-master/extensions//static/images/wikibase/echoIcon.svg' does not exist". Independent of version since many other instances of this exception were logged in prior runs.
- "PHP Notice: Trying to access array offset on value of type bool" from SkinModule.php lines 705, 706, 707. This does not occur in PHP 7.2
mwscript updateSpecialPages.php --wiki=enwiki --override

In summary, two possible bugs which I am investigating: one in refreshLinkRecommendations.php and one in blameStartupRegistry.php.

I filed the first bug as T310765.

I filed the second bug as T310767.

Note that both bugs represent backwards-compatible changes in PHP 7.4 in which incorrect code was previously silent but now raises a notice. These bugs demonstrate that we are better off running PHP 7.4 in production so that such errors can be detected.

That concludes the work requested in the task description.

tstarling closed this task as Resolved.Jun 16 2022, 3:40 AM

tstarling claimed this task.

tstarling updated the task description. (Show Details)

tstarling added a subscriber: JMeybohm.

tstarling mentioned this in T271736: Migrate WMF production from PHP 7.2 to PHP 7.4.Aug 17 2022, 7:29 AM

Test running php7.2 and php7.4 in parallel on the beta clusterClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Test running php7.2 and php7.4 in parallel on the beta cluster
Closed, ResolvedPublic
Actions

Related Objects
Search...