Page MenuHomePhabricator

Upgrade MediaWiki clusters to Debian Buster (debian 10)
Closed, ResolvedPublic

Description

Our 4 MediaWiki clusters, application, api, jobrunners/videoscalers, parsoid need to be migrated to Debian Buster.

progress: see https://docs.google.com/spreadsheets/d/1Ris18-joRFfd3OHjGJIraVUk-bpmIRORsPoms9D7BcM/edit?usp=sharing

Provisional plan for the migration:

  • Upgrade all current stretch servers to ICU 63 T264991
  • Rebuild all our php-7.2 packages for Debian Buster (buster-wikimedia)
    • php7.2-cli
    • php7.2-common
    • php7.2-curl
    • php7.2-dba
    • php7.2-fpm
    • php7.2-gd
    • php7.2-gmp
    • php7.2-mysql
    • php7.2-opcache
    • php7.2-phpdbg
    • php7.2-readline
    • php7.2-xml
  • Build missing packages for Buster
    • ploticus
    • prometheus-nutcracker-exporter
    • prometheus-php-fpm-exporter
  • Fix puppet code to support Buster
    • ttf-alee replaced with fonts-alee
    • ttf-wqy-zenhei replaced with fonts-wqy-zenhei
    • code to add PHP72 component on buster
  • Reimage mwdebug1001 to buster OR introduce mwdebug1003, so not to mess with development testing
    • first iteration done with testvm1001, decom'ed again
    • mwdebug1003 to be introduced early December (T267248)
    • add PHP72 APT component on mwdebug1003
    • Reimage parse2001 to buster (parsoid)
    • Reimage mw2243 to buster (jobrunner)
  • Reimage mw1265 to Buster (weight=5)

Q3

Details

SubjectRepoBranchLines +/-
operations/dnsmaster+2 -2
operations/puppetproduction+0 -1
operations/puppetproduction+0 -1
operations/puppetproduction+0 -43
operations/puppetproduction+3 -4
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+0 -24
operations/puppetproduction+0 -121
operations/puppetproduction+0 -150
operations/puppetproduction+0 -4
operations/puppetproduction+9 -2
operations/puppetproduction+7 -1
operations/puppetproduction+0 -2
operations/puppetproduction+0 -1
operations/puppetproduction+2 -0
operations/debs/prometheus-php-fpm-exportermaster+14 -2
operations/puppetproduction+21 -28
operations/puppetproduction+5 -1
operations/puppetproduction+2 -2
operations/puppetproduction+8 -1
operations/puppetproduction+2 -1
operations/puppetproduction+5 -0
operations/puppetproduction+0 -8
Show related patches Customize query in gerrit

Related Objects

StatusSubtypeAssignedTask
ResolvedNone
ResolvedJdforrester-WMF
ResolvedJdforrester-WMF
ResolvedJdforrester-WMF
ResolvedJdforrester-WMF
Resolved toan
ResolvedLucas_Werkmeister_WMDE
ResolvedJoe
ResolvedJdforrester-WMF
ResolvedLadsgroup
InvalidNone
ResolvedReedy
OpenNone
Resolved tstarling
ResolvedJdforrester-WMF
StalledNone
ResolvedNone
ResolvedPRODUCTION ERRORLegoktm
Resolved tstarling
ResolvedJoe
ResolvedKrinkle
Resolvedhashar
ResolvedJdforrester-WMF
ResolvedDzahn
Resolvedhashar
ResolvedJdforrester-WMF
ResolvedLadsgroup
ResolvedMoritzMuehlenhoff
Resolvedjijiki
ResolvedMoritzMuehlenhoff
ResolvedTrizek-WMF
ResolvedDzahn
Resolved Gilles
ResolvedDzahn
ResolvedRequestPapaul
Resolvedjijiki
DeclinedNone
ResolvedDzahn
ResolvedDzahn
ResolvedPapaul
Resolved Cmjohnson
ResolvedRequest Cmjohnson
ResolvedRequestPapaul
ResolvedAndrew
ResolvedArielGlenn
ResolvedDzahn
ResolvedLegoktm
ResolvedPapaul
ResolvedDzahn
Declined Gilles
ResolvedVolans
ResolvedDzahn
ResolvedLegoktm

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Completed auto-reimage of hosts:

['mw1317.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

mw1316.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202102221850_dzahn_17022_mw1316_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

mw1315.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202102221855_dzahn_21586_mw1315_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

mw1349.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202102221947_dzahn_7004_mw1349_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['mw1316.eqiad.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['mw1315.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

mw1314.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202102222015_dzahn_4645_mw1314_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

mw1312.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202102222019_dzahn_8398_mw1312_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['mw1349.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

mw1279.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202102222029_dzahn_17792_mw1279_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['mw1314.eqiad.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['mw1312.eqiad.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['mw1279.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

mw1286.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202102222221_dzahn_30877_mw1286_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

mw1410.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202102222223_dzahn_620_mw1410_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

mw1412.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202102222224_dzahn_1715_mw1412_eqiad_wmnet.log.

@MoritzMuehlenhoff do you think it makes sense to keep 1 api and 1 app in stretch a bit longer as to keep comparing performance? IIRC there might be some upcoming perf optimisations on the mediawiki side.

Completed auto-reimage of hosts:

['mw1410.eqiad.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['mw1412.eqiad.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['mw1286.eqiad.wmnet']

and were ALL successful.

Change 635108 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] tcpircbot: allow deploy1002/2002, do not allow deploy1001/2001

https://gerrit.wikimedia.org/r/635108

Change 635108 merged by Dzahn:
[operations/puppet@production] tcpircbot: allow deploy1002/2002, do not allow deploy1001/2001

https://gerrit.wikimedia.org/r/635108

Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts:

['parse2001.codfw.wmnet', 'parse2002.codfw.wmnet', 'parse2003.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202103291305_jiji_19021.log.

Change 675506 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):
[operations/puppet@production] install_server: switch parsoid servers to buster

https://gerrit.wikimedia.org/r/675506

Change 675506 merged by Effie Mouzeli:
[operations/puppet@production] install_server: switch parsoid servers to buster

https://gerrit.wikimedia.org/r/675506

I have reimaged parse2001 as a test, and it appears that puppet is unable to run successfully because:

Error: Execution of '/usr/bin/scap deploy-local --repo parsoid/deploy -D log_json:False' returned 70: 15:19:26 Fetch from: http://deploy1001.eqiad.wmnet/parsoid/deploy/.git
15:19:26 Unhandled error:
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/scap/cli.py", line 347, in run
    exit_status = app.main(app.extra_arguments)
  File "/usr/lib/python2.7/dist-packages/scap/deploy.py", line 147, in main
    getattr(self, stage)()
  File "/usr/lib/python2.7/dist-packages/scap/deploy.py", line 291, in fetch
    git.fetch(self.context.cache_dir, git_remote)
  File "/usr/lib/python2.7/dist-packages/scap/git.py", line 374, in fetch
    git.clone(*cmd)
  File "/usr/lib/python2.7/dist-packages/scap/sh.py", line 1428, in __call__
    return RunningCommand(cmd, call_args, stdin, stdout, stderr)
  File "/usr/lib/python2.7/dist-packages/scap/sh.py", line 775, in __init__
    self.wait()
  File "/usr/lib/python2.7/dist-packages/scap/sh.py", line 793, in wait
    self.handle_command_exit_code(exit_code)
  File "/usr/lib/python2.7/dist-packages/scap/sh.py", line 816, in handle_command_exit_code
    raise exc
ErrorReturnCode_128:

  RAN: /usr/bin/git clone --jobs 46 http://deploy1001.eqiad.wmnet/parsoid/deploy/.git /srv/deployment/parsoid/deploy-cache/cache

  STDOUT:


  STDERR:
Cloning into '/srv/deployment/parsoid/deploy-cache/cache'...
fatal: unable to access 'http://deploy1001.eqiad.wmnet/parsoid/deploy/.git/': Could not resolve host: deploy1001.eqiad.wmnet

15:19:26 deploy-local failed: <ErrorReturnCode_128>

  RAN: /usr/bin/git clone --jobs 46 http://deploy1001.eqiad.wmnet/parsoid/deploy/.git /srv/deployment/parsoid/deploy-cache/cache

  STDOUT:


  STDERR:
Cloning into '/srv/deployment/parsoid/deploy-cache/cache'...
fatal: unable to access 'http://deploy1001.eqiad.wmnet/parsoid/deploy/.git/': Could not resolve host: deploy1001.eqiad.wmnet


Error: /Stage[main]/Parsoid/Service::Node[parsoid]/Scap::Target[parsoid/deploy]/Package[parsoid/deploy]/ensure: change from 'absent' to 'present' failed: Execution of '/usr/bin/scap deploy-local --repo parsoid/deploy -D log_json:False' returned 70: 15:19:26 Fetch from: http://deploy1001.eqiad.wmnet/parsoid/deploy/.git
15:19:26 Unhandled error:
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/scap/cli.py", line 347, in run
    exit_status = app.main(app.extra_arguments)
  File "/usr/lib/python2.7/dist-packages/scap/deploy.py", line 147, in main
    getattr(self, stage)()
  File "/usr/lib/python2.7/dist-packages/scap/deploy.py", line 291, in fetch
    git.fetch(self.context.cache_dir, git_remote)
  File "/usr/lib/python2.7/dist-packages/scap/git.py", line 374, in fetch
    git.clone(*cmd)
  File "/usr/lib/python2.7/dist-packages/scap/sh.py", line 1428, in __call__
    return RunningCommand(cmd, call_args, stdin, stdout, stderr)
  File "/usr/lib/python2.7/dist-packages/scap/sh.py", line 775, in __init__
    self.wait()
  File "/usr/lib/python2.7/dist-packages/scap/sh.py", line 793, in wait
    self.handle_command_exit_code(exit_code)
  File "/usr/lib/python2.7/dist-packages/scap/sh.py", line 816, in handle_command_exit_code
    raise exc
ErrorReturnCode_128:

  RAN: /usr/bin/git clone --jobs 46 http://deploy1001.eqiad.wmnet/parsoid/deploy/.git /srv/deployment/parsoid/deploy-cache/cache

  STDOUT:


  STDERR:
Cloning into '/srv/deployment/parsoid/deploy-cache/cache'...
fatal: unable to access 'http://deploy1001.eqiad.wmnet/parsoid/deploy/.git/': Could not resolve host: deploy1001.eqiad.wmnet

15:19:26 deploy-local failed: <ErrorReturnCode_128>

  RAN: /usr/bin/git clone --jobs 46 http://deploy1001.eqiad.wmnet/parsoid/deploy/.git /srv/deployment/parsoid/deploy-cache/cache

  STDOUT:


  STDERR:
Cloning into '/srv/deployment/parsoid/deploy-cache/cache'...
fatal: unable to access 'http://deploy1001.eqiad.wmnet/parsoid/deploy/.git/': Could not resolve host: deploy1001.eqiad.wmnet


Notice: /Stage[main]/Parsoid/Service::Node[parsoid]/Base::Service_unit[parsoid]/File[/lib/systemd/system/parsoid.service]: Dependency Package[parsoid/deploy] has failures: true
Warning: /Stage[main]/Parsoid/Service::Node[parsoid]/Base::Service_unit[parsoid]/File[/lib/systemd/system/parsoid.service]: Skipping because of failed dependencies
Warning: /Stage[main]/Parsoid/Service::Node[parsoid]/Base::Service_unit[parsoid]/Exec[systemd reload for parsoid]: Skipping because of failed dependencies
Warning: /Stage[main]/Parsoid/Service::Node[parsoid]/Base::Service_unit[parsoid]/Service[parsoid]: Skipping because of failed dependencies
Notice: Applied catalog in 22.96 seconds

I have reimaged parse2001 as a test, and it appears that puppet is unable to run successfully because:

Error: Execution of '/usr/bin/scap deploy-local --repo parsoid/deploy -D log_json:False' returned 70: 15:19:26 Fetch from: http://deploy1001.eqiad.wmnet/parsoid/deploy/.git

@jijiki This is where the deploy1001 appears in:

deployment/parsoid/deploy-cache/.config:git_server: deploy1001.eqiad.wmnet

editing that file should fix it.

Other options from the past appear to include: "run scap with --refresh-config, delete cached .config file".

For more background also see T197470 , T197470#4414254 , T162814, T196663#4265139 afaict

Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts:

['parse2001.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202104081320_jiji_21421.log.

Completed auto-reimage of hosts:

['parse2001.codfw.wmnet']

and were ALL successful.

Change 680483 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] DHCP: switch mw1307 to use buster installer

https://gerrit.wikimedia.org/r/680483

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

mw1402.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202104162338_dzahn_27978_mw1402_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

mw1403.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202104162338_dzahn_28020_mw1403_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

mw1307.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202104162340_dzahn_28210_mw1307_eqiad_wmnet.log.

Change 680483 merged by Dzahn:

[operations/puppet@production] DHCP: switch mw1307 to use buster installer

https://gerrit.wikimedia.org/r/680483

Remaining 3 special cases kept on stretch now reimaged to buster as well.

Decom'ed mwdebug1003 VM.

Everything here is completely donenow... except mwmaint1002. Which will happen during the DC switchover.

Completed auto-reimage of hosts:

['mw1403.eqiad.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['mw1402.eqiad.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['mw1307.eqiad.wmnet']

and were ALL successful.

Dzahn changed the task status from Open to Stalled.Aug 5 2021, 1:37 PM

this is only open due to a single remaining server, the mwmaint servers in codfw. this will be upgraded after we switch DCs back on September 13th

Change 721358 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] DHCP: switch mwmaint2002 from stretch to buster installer

https://gerrit.wikimedia.org/r/721358

Change 721546 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/dns@master] switch mwmaint.discovery.wmnet from codfw to eqiad

https://gerrit.wikimedia.org/r/721546

Change 721546 merged by Dzahn:

[operations/dns@master] switch mwmaint.discovery.wmnet from codfw to eqiad

https://gerrit.wikimedia.org/r/721546

Change 721358 merged by Dzahn:

[operations/puppet@production] DHCP: switch mwmaint2002 from stretch to buster installer

https://gerrit.wikimedia.org/r/721358

Dzahn changed the task status from Stalled to In Progress.Sep 16 2021, 2:38 PM
Dzahn changed the status of subtask T267607: upgrade mwmaint servers to buster from Stalled to In Progress.

https://noc.wikimedia.org (mwmaint.discovery.wmnet) has been switched from codfw to eqiad.

mwmaint2002 has been upgraded to buster. monitoring all green.

This was the last open check box and completes the task.

fgiunchedi subscribed.

I don't think this is resolved, see T275752 for jobrunner on buster slowness in upload