Page MenuHomePhabricator

urbanecmbot's continuous jobs are getting restarted too frequently
Open, Needs TriagePublic

Description

Since few days ago, I started receiving mails from cron with the following content:

2021-10-23T23:00:14+00:00 Restarting job 'patrolTrusted' ('/data/project/urbanecmbot/11bots/cswiki/userbots/patrolTrusted/patrolTrusted.sh')
Your job 2061435 ("patrolTrusted") has been submitted

or

2021-10-25T06:25:04+00:00 Restarting job 'patrolAfterPatrol' ('/data/project/urbanecmbot/11bots/cswiki/userbots/patrolTrusted/patrolEditsAfterArticle.sh')
Your job 2144807 ("patrolAfterPatrol") has been submitted

or

2021-10-24T10:00:04+00:00 Restarting job 'patrolSandbox' ('/data/project/urbanecmbot/11bots/cswiki/userbots/patrolTrusted/patrolSandbox.sh')
Your job 2090725 ("patrolSandbox") has been submitted

likely from the bigbrother replacement script.

The job output files don't have much guidance over what happens.

Any ideas what might be happening? Tagging Toolforge and cloud-services-team to get some attention to this issue, feel free to re-tag with tags appropriate for user-support requests, as appropriate.

Related Objects

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

FTR, few days ago, I did the change at https://gerrit.wikimedia.org/r/c/labs/tools/urbanecmbot/+/734228 as an uncommited one (ls -l suggests it happened on Oct 23), still receiving a ton of mails :/.

Looking at SGE's messages file I see multiple entries like

10/25/2021 11:20:20| timer|tools-sgegrid-master|W|failed to deliver job 894238.1 to queue "task@tools-sgeexec-0910.tools.eqiad.wmflabs"
10/25/2021 11:20:31| timer|tools-sgegrid-master|W|failed to deliver job 954122.1 to queue "task@tools-sgeexec-0934.tools.eqiad.wmflabs"
10/25/2021 11:22:36| timer|tools-sgegrid-master|W|failed to deliver job 333140.1 to queue "task@tools-sgeexec-0916.tools.eqiad.wmflabs"

accross four different nodes:

root@tools-sgegrid-master:~# grep "failed to deliver job" /data/project/.system_sge/gridengine/spool/qmaster/messages | awk '{print $10}'|sort|uniq -c
    425 "continuous@tools-sgeexec-0933.tools.eqiad.wmflabs"
    425 "task@tools-sgeexec-0910.tools.eqiad.wmflabs"
    425 "task@tools-sgeexec-0916.tools.eqiad.wmflabs"
    425 "task@tools-sgeexec-0934.tools.eqiad.wmflabs"

I'm going to depool one to debug it further / reboot / whatever

Mentioned in SAL (#wikimedia-cloud) [2021-10-25T11:32:31Z] <majavah> depool tools-sgeexec-0910 T294228

I depooled sgeexec-0910. There's still some jobs running even if SGE says it has 1 job in ´dt´ state and is otherwise empty.

1taavi@tools-sgeexec-0910:~ $ systemctl status gridengine-exec.service
2● gridengine-exec.service - LSB: SGE Execution Daemon init script
3 Loaded: loaded (/etc/init.d/gridengine-exec; generated; vendor preset: enabled)
4 Active: active (running) since Thu 2021-03-25 17:49:48 UTC; 7 months 0 days ago
5 Docs: man:systemd-sysv-generator(8)
6 Tasks: 42 (limit: 4915)
7 CGroup: /system.slice/gridengine-exec.service
8 ├─ 781 /usr/lib/gridengine/sge_execd
9 ├─ 3516 /usr/lib/gridengine/sge_shepherd -bg
10 ├─ 3518 /usr/lib/gridengine/sge_shepherd -bg
11 ├─ 3519 AnomieBOT II (200): TemplateUnsubstifier TFATitleSubpageCreator
12 ├─ 3521 /bin/sh /var/spool/gridengine/execd/tools-sgeexec-0910/job_scripts/85368
13 ├─ 4108 /bin/bash /mnt/nfs/labstore-secondary-tools-project/archive-things-4/scripts/musescore-2.sh 5
14 ├─ 9962 /usr/lib/gridengine/sge_shepherd -bg
15 ├─ 9964 /bin/sh /var/spool/gridengine/execd/tools-sgeexec-0910/job_scripts/3244486
16 ├─17742 python3 /shared/pywikipedia/core/pwb.py /data/project/yifeibot/pywikibot-shared/addbot-worker.py
17 ├─18211 /usr/lib/gridengine/sge_shepherd -bg
18 ├─18213 /bin/sh /var/spool/gridengine/execd/tools-sgeexec-0910/job_scripts/651655
19 ├─18214 /usr/bin/python3 /data/project/ashbot/pywiki/pwb.py /data/project/ashbot/pywiki/scripts/userscripts/
20 ├─18399 /data/project/datbot/py3/bin/python /data/project/datbot/Tasks/pendingbacklog/pending.py
21 ├─23992 wget -nv --retry-connrefused --max-redirect=0 -a /data/project/archive-things-4/musescore/scores-202
22 ├─24388 /usr/lib/gridengine/sge_shepherd -bg
23 ├─24390 /bin/sh /var/spool/gridengine/execd/tools-sgeexec-0910/job_scripts/9570401
24 ├─24391 /data/project/wikibugs/py35-stretch/bin/python /data/project/wikibugs/libera/redis2irc.py --logfile
25 ├─24475 /usr/lib/gridengine/sge_shepherd -bg
26 ├─24477 /bin/sh /var/spool/gridengine/execd/tools-sgeexec-0910/job_scripts/107318
27 ├─26640 /usr/lib/gridengine/sge_shepherd -bg
28 ├─26642 /bin/sh /var/spool/gridengine/execd/tools-sgeexec-0910/job_scripts/1302449
29 ├─26643 /usr/bin/php persist.php until_It_Sleeps_botsconfig.php
30 ├─26644 sh -c "php" ./irc.php until_It_Sleeps_botsconfig.php
31 └─26645 php ./irc.php until_It_Sleeps_botsconfig.php

I rebooted the node (to get rid of the stray processes) and re-pooled it.

Unfortunately, this continues to happen. Any advice would be appreciated.