Page MenuHomePhabricator

Test running php7.2 and php7.4 in parallel on the beta cluster
Closed, ResolvedPublic

Description

We need to test that php 7.2 and php 7.4 can coexist safely in deplopyment-prep. To this end we need to:

  • Install an appserver with both
  • Install a jobrunner with both
  • Install both on the deployment server. This is a good oppportunity to test stages of scap that need php.
  • Install on the maintenance server, test some periodic scripts
  • Install on a dumps test server, run tests. T295580

For configuration of changeprop, there is https://wikitech.wikimedia.org/wiki/Changeprop#To_deployment-prep but the steps described there don't seem to generate a valid configuration currently.

And we should test basic functionality on all of those.

Related Objects

StatusSubtypeAssignedTask
ResolvedNone
ResolvedJdforrester-WMF
ResolvedJdforrester-WMF
ResolvedJdforrester-WMF
ResolvedJdforrester-WMF
Resolved toan
ResolvedLucas_Werkmeister_WMDE
ResolvedJoe
ResolvedJdforrester-WMF
ResolvedLadsgroup
InvalidNone
ResolvedReedy
OpenNone
Resolvedtstarling
ResolvedJdforrester-WMF
ResolvedPRODUCTION ERRORLegoktm
Resolvedtstarling
ResolvedJoe
ResolvedJoe
Resolvedtstarling
ResolvedArielGlenn

Event Timeline

I have installed a fresh appserver whith both php 7.2 and 7.4, and:

  • Request get redirected correctly based on the php_engine cookie
  • monitoring reports data about both engines
  • Responses from 7.4 don't obviously miss details
  • php7.2 is chosen by default, and is the fallback in any case.
Joe triaged this task as Medium priority.Jan 5 2022, 9:06 AM

Change 755536 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/puppet@production] deployment-prep: install php 7.4 everywhere

https://gerrit.wikimedia.org/r/755536

Change 755536 merged by JMeybohm:

[operations/puppet@production] deployment-prep: install php 7.4 everywhere

https://gerrit.wikimedia.org/r/755536

re-assigning to Janis as he's been doing work in this direction in the last week.

Puppet is failing on deployment-parsoid12:

Mar 03 07:07:58 deployment-parsoid12 php[31713]: PHP Warning:  PHP Startup: Unable to load dynamic library 'wddx.so' (tried: /usr/lib/php/20190902/wddx.so (/usr/lib/php/20190902/wddx.so: cannot open shared object file: No such file or directory), /usr/lib/php/20190902/wddx.so.so (/usr/lib/php/20190902/wddx.so.so: cannot open shared object file: No such file or directory)) in Unknown on line 0

Thanks @Majavah !

wddx is no longer bundled with PHP >=7.4 and T295725 suggests we could just stop loading it.

Change 769745 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] Stop loading wddx PHP extension with PHP 7.4

https://gerrit.wikimedia.org/r/769745

Change 769745 merged by JMeybohm:

[operations/puppet@production] Stop loading wddx PHP extension with PHP 7.4

https://gerrit.wikimedia.org/r/769745

Manual tests (with and without cookie PHP_ENGINE=7.4, as well as invalid values) against appserver and parsoid seem fine to me so far

Change 787526 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] changeprop: use helm3 semantics for generating beta config

https://gerrit.wikimedia.org/r/787526

There were some issues with the make_beta_config.py script for helm3, there's a CR open to fix it. In general is the plan to create an entirely independent second instance of changeprop or just to update its configuration?

Change 787526 merged by jenkins-bot:

[operations/deployment-charts@master] changeprop: use helm3 semantics for generating beta config

https://gerrit.wikimedia.org/r/787526

I spent a couple of hours familiarising myself with how cpjobqueue is configured and how it works in the beta cluster. I still don't understand why cpjobqueue needs to be part of this project. The task description says "install a jobrunner with both", but that is already done.

A reasonable deployment sequence for job execution might be:

  1. Run one job with PHP 7.4
    • This can be done by sending a fake job with curl, no need for cpjobqueue.
  2. Run some subset of real jobs with PHP 7.4
    • The implementation depends on what subset you want. Several options can be implemented in the Apache config without touching cpjobqueue.
  3. Run all jobs with PHP 7.4
    • This can be done in the Apache config.

I think the only option for step 2 that would need cpjobqueue involvement is if cookies are to be forwarded from a developer's browser to MediaWiki, then to cpjobqueue to be stored in Kafka, then back to MediaWiki again, similar to how request ID forwarding works. This is a complex option requiring changes to 3 or 4 repos.

I think my preferred option for step 2 is a random sample implemented in Apache.

I think my preferred option for step 2 is a random sample implemented in Apache.

OK, that turned out to be more involved than expected. Apache doesn't have an easy way to get random numbers, and mod_proxy_balancer doesn't work for FastCGI backends. The complexity of the remaining solutions look similar to doing it in cpjobqueue.

If cpjobqueue just provided a random integer in a header, Apache could take it from there. Or I could do cookie forwarding through cpjobqueue. Having cpjobqueue decide what version to send means we have to reconfigure it to turn the experiment on and off, which I want to avoid since the difficulty of reconfiguring this service is evidently so great that it derailed the whole project for months.

Having cpjobqueue decide what version to send means we have to reconfigure it to turn the experiment on and off, which I want to avoid since the difficulty of reconfiguring this service is evidently so great that it derailed the whole project for months.

cpjobqueue has been reconfigured in deployment-prep multiple times recently, it's documented and takes a matter of minutes to do.

I don't think we need per-jobtype control here. From my perspective, the important points here are:

  1. we can switch PHP verisons by cookie for appservers in production, and we might as well test this out in beta.
  2. we can control the PHP version used for a given jobrunner in production in an easy way, e.g. Hiera boolean variable, to let SRE ramp it up progressively, and again we might as well use the same mechanism to flip the (1) jobrunner in Beta to use php7.4.
  3. we install both PHP versions on all jobrunners, so that it's quick to switch and to keep them as similar to appservers as possible.

If we approach this minimally with just that, then we don't need to insert any cookies or headers from cpjobqueue. I don't oppose it if it's a trivial config change to have cpjobqueue insert a cookie header on all RunJob requests, but it seems like that would require extra work, because it targets the loadbalancer not individual hosts and presumably doesn't have a built-in way to randomly set a header for a % of traffic. So on the assumption that it isn't a trivial config change, my suggestion would be to approach it only from the Apache side.

In hiera there is profile::mediawiki::php::php_versions, which is an array of PHP versions like [ "7.2", "7.4" ]. The default is the first element in the array. So we can test job execution on 7.4 by swapping the order of the versions for a specific server.

Let's assume there will be no cpjobqueue patch, and production pilot deployment will be done by switching to PHP 7.4 on a particular jobrunner using hiera.

I tested swapping the versions in horizon hiera. This caused puppet to break PHP-FPM by swapping the port numbers -- php7.4-fpm failed to start because port 8000 was still in use by php7.2-fpm.

A workaround for this is to just increment the port number each time the version set is changed.

profile::mediawiki::php::php_versions:
- '7.4'
- '7.2'
profile::php_fpm::fcgi_port: 8002

I tested this on deployment-jobrunner04 and it worked fine. It avoids the need to dive into the puppet classes and change the way we assign ports.

presumably doesn't have a built-in way to randomly set a header for a % of traffic

Right, I think that would be a new feature. But I talked to @JMeybohm just now, who said that the idea was to set the header for a specific job type, globally. That can be done in cpjobqueue configuration.

Yes, the idea was to have jobqueue send the header so it can be set per job type and would not require reconfiguration of the servers (as the PHP_VERSION cookie selector is already present).

For me ./make_beta_config.py . jobqueue now generates the config that is running in deployment-prep (which, IIRC, wasn't the case when I tried last time. There was a diff of a couple of hundred lines). So I guess something like the following patch would be enough to render a config that makes jobqueue send the cookie along:

diff --git a/charts/changeprop/templates/_jobqueue.yaml b/charts/changeprop/templates/_jobqueue.yaml
index 4dc34c16..6ac56c5f 100644
--- a/charts/changeprop/templates/_jobqueue.yaml
+++ b/charts/changeprop/templates/_jobqueue.yaml
@@ -143,6 +143,7 @@ spec: &spec
                   method: post
                   uri: '{{ $jobrunner_uri }}'
                   headers:
+                    cookie: 'PHP_VERSION=7.4'
                     content-type: 'application/json'
                     x-request-id: '{{ `{{globals.message.meta.request_id}}` }}'
                   body: '{{ `{{globals.message}}` }}'
@@ -160,6 +161,7 @@ spec: &spec
                   method: 'post'
                   uri: '/sys/partition/{{ $part_options.partitioner_kind }}/'
                   headers:
+                    cookie: 'PHP_VERSION=7.4'
                     content-type: "application/json"
                     x-request-id: '{{ `{{globals.message.meta.request_id}}` }}'
                   body: '{{ `{{globals.message}}` }}'
@@ -174,6 +176,7 @@ spec: &spec
                   method: post
                   uri: '{{ $jobrunner_uri }}'
                   headers:
+                    cookie: 'PHP_VERSION=7.4'
                     content-type: 'application/json'
                     x-request-id: '{{ `{{globals.message.meta.request_id}}` }}'
                   body: '{{ `{{globals.message}}` }}'
@@ -196,6 +199,7 @@ spec: &spec
                   method: post
                   uri: '{{ $videoscaler_uri }}'
                   headers:
+                    cookie: 'PHP_VERSION=7.4'
                     content-type: 'application/json'
                     x-request-id: '{{ `{{globals.message.meta.request_id}}` }}'
                   body: '{{ `{{globals.message}}` }}'
@@ -226,6 +230,7 @@ spec: &spec
                   method: post
                   uri: '{{ $jobrunner_uri }}'
                   headers:
+                    cookie: 'PHP_VERSION=7.4'
                     content-type: 'application/json'
                     x-request-id: '{{ `{{globals.message.meta.request_id}}` }}'
                   body: '{{ `{{globals.message}}` }}'

Sorry for some rambling, I'm not very familiar with docker. I'm exploring it and I figure others might have some useful observations. But if not, there will be gerrit commits and documentation edits.

On deployment-deploy03 I edited make_beta_config.py to use helm3 instead of helm (which is a symlink to helm2), then I ran it and ssh-piped the result to deployment-docker-cpjobqueue01:/home/tstarling/config.yaml. I ran docker container ls, then docker container inspect on the running container, which showed a bind mount from /var/lib/docker/volumes/cpjobqueue/_data on the host to /etc/cpjobqueue in the container. The generated configuration seems the same as the deployed configuration.

root@deployment-docker-cpjobqueue01:~# md5sum /var/lib/docker/volumes/cpjobqueue/_data/config.yaml ~tstarling/config.yaml
985961fbb17caed6ffc3f64063718852  /var/lib/docker/volumes/cpjobqueue/_data/config.yaml
985961fbb17caed6ffc3f64063718852  /home/tstarling/config.yaml

wt:Changeprop suggests docker run -it -v cpjobqueue:/srv/cpjobqueue alpine /bin/sh then inspecting /srv/cpjobqueue/config.yaml, which produces the same result. The new container has the bind mount /var/lib/docker/volumes/cpjobqueue/_data -> /srv/cpjobqueue so you see the same files that way, but inside the container you don't have the tools you need to update it.

The doc then suggests service cpjobqueue restart. There is /lib/systemd/system/cpjobqueue.service on the host which restarts the docker container.

Another way to get the location of the config file on the host is docker volume inspect cpjobqueue -f '{{.Mountpoint}}'. But it seems like it would be easier to use a fixed bind mount instead of a volume. I found an explanation for this weird situation in puppet/modules/service/manifests/docker.pp , apparently it is "just to bypass a 64KB limitation of horizon... The entire lifecycle management of that docker volume is left up to the user of the volume. So precreate it and prepopulate it please".

I think the volume was not precreated. That seems fixable...

I made /etc/cpjobqueue on the host but was not able to reconfigure the volume to use it. I think I would have to delete the container, which means I would have to know how to recreate it, which I don't.

By the way, I temporarily stopped puppet while investigating this, but systemctl start puppet does not work because /etc/puppet/puppet.conf has daemonize=false which breaks the systemd unit file. So I started puppet manually.

I applied @JMeybohm's patch to my working copy and deployed it to cpjobqueue01. Then I fixed it with s/PHP_VERSION/PHP_ENGINE/ and deployed it again. Then I confirmed with tcpdump that the jobrunner returns X-Powered-By: PHP/7.4.28.

Change 804486 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[operations/deployment-charts@master] make_beta_config.py: run helm as helm3

https://gerrit.wikimedia.org/r/804486

Mentioned in SAL (#wikimedia-releng) [2022-06-10T04:03:34Z] <TimStarling> on deployment-deploy03 running scap sync-world -v with PHP 7.2 for T295578 sanity check

The next checklist item is to test scap with PHP 7.4. I reviewed scap and extracted the following list of scripts that it runs:

  • rebuildLocalisationCache.php
  • mergeMessageFileList.php
  • purgeMessageBlobStore.php
  • extensions/WikimediaMaintenance/refreshMessageBlobs.php
  • eval.php (check_fatals)
  • dumpInterwiki.php
  • php -l

All scripts should be executed by following this procedure:

rm /srv/mediawiki-staging/php-master/cache/l10n/l10n_cache-en.cdb
scap sync-world -v
scap sync-world -v -D delay_messageblobstore_purge:false
scap update-interwiki-cache -v --beta

I did that on deployment-deploy03 with PHP 7.2 and it seemed to work. All the executions shows up in the debug log except dumpInterwiki.php, so I confirmed that dumpInterwiki.php was being run using strace.

scap has the php_version config var, which is passed through to mwscript as the PHP environment variable, so that should hopefully work for most things, without the need for update-alternatives.

Mentioned in SAL (#wikimedia-releng) [2022-06-10T04:27:51Z] <TimStarling> on deployment-deploy03 running scap sync-world -v with PHP 7.4 for T295578

I ran

rm /srv/mediawiki-staging/php-master/cache/l10n/l10n_cache-en.cdb
scap sync-world -v -D php_version:php7.4
PHP=php7.4 scap sync-world -v -D delay_messageblobstore_purge:false -D php_version:php7.4
scap update-interwiki-cache -v --beta -D php_version:phpfoo
scap update-interwiki-cache -v --beta -D php_version:php7.4

Everything ran as PHP 7.4 except php -l. The php_version:phpfoo check confirmed that the PHP environment variable was passed through to mwscript without any special handling at the call site.

To test php -l I did:

find -O2 '/srv/mediawiki-staging/wmf-config' '/srv/mediawiki-staging/multiversion' -not -type d -name '*.php' -not -name 'autoload_static.php'  -or -name '*.inc' | xargs -n1 -P2 -exec php7.4 -l

Also, I noticed that scap.cfg has

php_fpm: php7.2-fpm
php_fpm_restart_script: /usr/local/sbin/check-and-restart-php
php_fpm_unsafe_restart_script: /usr/local/sbin/restart-php7.2-fpm
php_fpm_always_restart: True

That will need to be updated. But this test is complete and I'm checking the box.

Change 804486 abandoned by Tim Starling:

[operations/deployment-charts@master] make_beta_config.py: run helm as helm3

Reason:

removed helm2

https://gerrit.wikimedia.org/r/804486

I inspected the logs from the last 5 days of PHP 7.4 job runner execution on beta. The only relevant log entries I could find were three exceptions with the message "Failed to load data blob from Bad data in text row 23625" from dewiki. Using eval.php I confirmed that the exception occurs with both 7.2 and 7.4.

I ran the following maintenance scripts on the beta cluster with PHP 7.4:

  • mwscript maintenance/cleanupUploadStash.php --wiki=enwiki
    • Broken independent of version, 49 files fail with "UploadStashBadPathException"
  • mwscript refreshLinks.php --wiki=enwiki --dfn-only
  • mwscript initSiteStats.php --wiki=enwiki --update
  • mwscript maintenance/getLagTimes.php --wiki=enwiki --report
  • mwscript extensions/GrowthExperiments/maintenance/deleteOldSurveys.php --wiki=enwiki --cutoff 60
  • mwscript extensions/GrowthExperiments/maintenance/fixLinkRecommendationData.php --wiki=enwiki --search-index --db-table --dry-run --statsd
  • mwscript extensions/GrowthExperiments/maintenance/listTaskCounts.php --wiki=enwiki --tasktype link-recommendation --topictype ores --statsd --output none
  • mwscript extensions/GrowthExperiments/maintenance/purgeExpiredMentorStatus.php --wiki=enwiki
  • mwscript extensions/GrowthExperiments/maintenance/refreshLinkRecommendations.php --wiki=enwiki --verbose
    • Long running, eventually started failing with HTTP "429 Too Many Requests"
    • error.log showed "PHP Notice: Trying to access array offset on value of type null" from LinkRecommendationUpdater.php lines 208 and 209.
    • Also "PHP Warning: curl_multi_remove_handle(): supplied resource is not a valid cURL Multi Handle resource" but this was probably unrelated log traffic, vague backtrace Monolog\Handler\Handler->__destruct, thousands of events per day in archives.
  • mwscript extensions/GrowthExperiments/maintenance/updateMenteeData.php --wiki=enwiki --statsd --dbshard s1
  • mwscript extensions/PageAssessments/maintenance/purgeUnusedProjects.php --wiki=enwiki
  • mwscript extensions/PageTriage/cron/updatePageTriageQueue.php --wiki=enwiki
  • mwscript extensions/AbuseFilter/maintenance/PurgeOldLogIPData.php --wiki=enwiki
  • mwscript extensions/CheckUser/maintenance/purgeOldData.php --wiki=enwiki
  • mwscript maintenance/purgeExpiredBlocks.php --wiki=enwiki
  • mwscript maintenance/purgeExpiredUserrights.php --wiki=enwiki
  • mwscript extensions/SecurePoll/cli/purgePrivateVoteData.php --wiki=enwiki
  • mwscript recountCategories.php --wiki=enwiki --mode all
  • mwscript extensions/WikimediaMaintenance/blameStartupRegistry.php --wiki=enwiki --record-stats
    • "MWException: File '/srv/mediawiki/php-master/extensions//static/images/wikibase/echoIcon.svg' does not exist". Independent of version since many other instances of this exception were logged in prior runs.
    • "PHP Notice: Trying to access array offset on value of type bool" from SkinModule.php lines 705, 706, 707. This does not occur in PHP 7.2
  • mwscript updateSpecialPages.php --wiki=enwiki --override

In summary, two possible bugs which I am investigating: one in refreshLinkRecommendations.php and one in blameStartupRegistry.php.

I filed the second bug as T310767.

Note that both bugs represent backwards-compatible changes in PHP 7.4 in which incorrect code was previously silent but now raises a notice. These bugs demonstrate that we are better off running PHP 7.4 in production so that such errors can be detected.

That concludes the work requested in the task description.

tstarling claimed this task.
tstarling updated the task description. (Show Details)
tstarling added a subscriber: JMeybohm.