Page MenuHomePhabricator

Beta cron jobs seem broken
Closed, ResolvedPublic

Description

Checking any of the jobs on deployment-mwmaint01, I get errors like

 sudo systemctl status mediawiki_job_readinglists_purge.service
● mediawiki_job_readinglists_purge.service - MediaWiki periodic job readinglists
   Loaded: loaded (/lib/systemd/system/mediawiki_job_readinglists_purge.service;
   Active: failed (Result: exit-code) since Thu 2021-03-11 02:42:01 UTC; 15h ago
  Process: 14022 ExecStart=/usr/local/bin/mw-cli-wrapper /usr/local/bin/mwscript
 Main PID: 14022 (code=exited, status=2)

This seems to be due to /usr/local/bin/mw-cli-wrapper being broken: it refers to /etc/conftool-state/mediawiki.yaml but that file doesn't exist.

Event Timeline

I have seen the same status=2 exit last week, although I didn't investigate then, so this is probably not new.

Taking a look, this is suspiciously timed with my work on T276462: Replace deployment-etcd-01 with a Buster host.

● fetch_dbconfig.service                                                    loaded failed failed    Fetch the dbconfig from etcd and store it locally

The presence of said file is controlled by the hiera key profile::conftool::state::ensure which has been set to false for a long time. Enabling it doesn't seem a good idea either, since it wouldn't find the correct etcd server. Not exactly sure what to do here.

On a side note, I also have no idea how this has even worked before; didn't find anything that suggests the etcd server switch was causing this.

For the time being, I'll just wing it:

# hand-created file. See https://phabricator.wikimedia.org/T277206
# the master datacenter for mediawiki
primary_dc: labs
# read-only settings
read_only:
  labs: true

That seems enough to get the jobs running.

read_only should be false, but whatever, mw-cli-wrapper only checks for the primary_dc line.

The file disappeared after two hours or so.

Puppet likely has something that removes it when conftool isn't being used. I guess it could be hacked out, but I filed a task for the proper long-term solution, T278007: Configure etcd/confd/conftool in beta/deployment-prep like production.

One month later, is this still an ongoing issue / existing problem?

One month later, is this still an ongoing issue / existing problem?

Yes, this is still a problem that needs to be fixed.

I added this to the file (locally on mwmaint01):

urbanecm@deployment-mwmaint01:~$ cat /etc/conftool-state/mediawiki.yaml
primary_dc: labs
urbanecm@deployment-mwmaint01:~$

the CLI wrapper and the timer now...seems to work, at least somehow.

@Majavah could you help me find where this file is constructed on prod, and make the puppet rule working for beta too?

Puppet deletes that file, since Conftool is not supposed to be there. T278007: Configure etcd/confd/conftool in beta/deployment-prep like production would be the only proper fix, but since that isn't realistically happening now and people need these cronjobs, I've applied a very ugly hack for now:

taavi@deployment-puppetmaster04:/var/lib/git/operations/puppet/modules/profile/files/mediawiki/maintenance$ git show HEAD
commit 7c497a92c22b0c001b3e26781f003f03376acf76 (HEAD)
Author: Taavi Väänänen <hi@taavi.wtf>
Date:   Mon Apr 19 17:47:44 2021 +0000

    [LOCAL HACK] Hack mw-cli-wrapper.sh to work without conftool

    I don't like this, but broken things aren't exactly fun either.

    Bug: T277206
    Bug: T278007

diff --git a/modules/profile/files/mediawiki/maintenance/mw-cli-wrapper.sh b/modules/profile/files/mediawiki/maintenance/mw-cli-wrapper.sh
index 44b73bd0e9..2d284be102 100755
--- a/modules/profile/files/mediawiki/maintenance/mw-cli-wrapper.sh
+++ b/modules/profile/files/mediawiki/maintenance/mw-cli-wrapper.sh
@@ -2,9 +2,10 @@
 set -e
 CONFD_FILE='/etc/conftool-state/mediawiki.yaml'
 # First check if the confd file is stale or not. If it is, just exit
-/usr/local/lib/nagios/plugins/check_confd_template "$CONFD_FILE" > /dev/null
-master_dc=$(awk '/primary_dc/ { print $2 }' "$CONFD_FILE")
+#/usr/local/lib/nagios/plugins/check_confd_template "$CONFD_FILE" > /dev/null
+#master_dc=$(awk '/primary_dc/ { print $2 }' "$CONFD_FILE")
 my_dc=$(cat /etc/wikimedia-cluster)
+master_dc=$my_dc
 if [[ "$master_dc" = "$my_dc" ]];
 then
     exec "$@"

Mentioned in SAL (#wikimedia-releng) [2021-04-19T17:58:19Z] <Majavah> apply hack (https://phabricator.wikimedia.org/T277206#7015609) to deployment-puppetmaster04 to unbreak maintenance scripts until we have conftool

Thanks @Majavah! I can confirm that the maintenance scripts are running now.

Urbanecm assigned this task to taavi.

This is resolved by a hack. Removing the hack is out of scope for this task.