Page MenuHomePhabricator

Jenkins Gearman plugin has deadlock on executor threads (was: Beta Cluster stopped receiving code updates (beta-update-databases-eqiad hung)
Closed, ResolvedPublic

Description

Workaround

To remove the deadlock it is recommended to disconnect Jenkins from the Gearman server and reconnect it. This is done on the https://integration.wikimedia.org/ci/manage page:

jenkins-gearman-disconnect.png (205×590 px, 17 KB)

Uncheck the box, browse to the bottom and save. That removes the deadlock instantly. After a few seconds, check the box again and save.

If it still fail. Restart Jenkins entirely :(

Upstream bug is https://issues.jenkins-ci.org/browse/JENKINS-25867


From James' email to the QA list:

Beta Labs isn't synchronising; AFAICS it hasn't done so since ~ 11 hours
ago (15:10 UTC on 2014-09-08). I noticed this when prepping a patch for
tomorrow and found that.

Going to https://integration.wikimedia.org/ci/view/Beta/ I found that
"beta-update-databases-eqiad" had been executing for 12 hours, and
initially assumed that we had a run-away update.php issue again. However,
on examining it looks like "deployment-bastion.eqiad", or the jenkins
executor on it, isn't responding in some way:

pending—Waiting for next available executor on deployment-bastion.eqiad

I terminated the beta-update-databases-eqiad run to see if that would help,
but it just switched over to beta-scap-eqiad being the pending task.

Having chatted with MaxSem, I briefly disabled in the jenkins interface the
deployment-bastion.eqiad node and then re-enabled it, to no effect.

Any ideas?

November 2014 thread dump:

/ https://fastthread.io/my-thread-report.jsp?p=c2hhcmVkLzIwMTkvMDUvMy8tLWplbmtpbnMtdGhyZWFkcy1kdW1wLnR4dC0tMTAtMjctMw==

July 2019 one:

Another threaddump P8736
https://fastthread.io/my-thread-report.jsp?p=c2hhcmVkLzIwMTkvMDcvMTEvLS10aHJlYWRkdW1wLnR4dC0tOS0zMC0zMg==

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

@Antoine, I gave up on storyboard. We are working on it over in the jira issue tracker.. https://issues.jenkins-ci.org/browse/JENKINS-25867

one additional change you should pick up is https://review.openstack.org/#/c/192429/

@hashar, I gave up on storyboard. We are working on it over in the jira issue tracker.. https://issues.jenkins-ci.org/browse/JENKINS-25867

one additional change you should pick up is https://review.openstack.org/#/c/192429/

Ah thanks for switching to JIRA, this way we get mail notifications :-D I followed up there.

Rebuilding the plugin with https://review.openstack.org/#/c/192429/2

git fetch https://review.openstack.org/openstack-infra/gearman-plugin refs/changes/29/192429/2 && git checkout FETCH_HEAD
mvn -Dproject-version="`git describe`-change_192429_2" -DskipTests=true  clean package

Thus upgrading the plugin from 0.1.1-8-gf2024bd to 0.1.1-9-g08e9c42-change_192429_2.

I haven't noticed the error since July 1st nor does Jenkins logs show any null lock. Thus seems the fix in gearman plugin fixed it.

It happened again :(

Jul 28, 2015 10:30:35 AM FINE hudson.plugins.gearman.NodeAvailabilityMonitor canTake
AvailabilityMonitor canTake request for null
Jul 28, 2015 10:30:35 AM FINE hudson.plugins.gearman.NodeAvailabilityMonitor canTake
AvailabilityMonitor canTake request for null
Jul 28, 2015 10:30:35 AM FINE hudson.plugins.gearman.NodeAvailabilityMonitor canTake
AvailabilityMonitor canTake request for null

With beta-scap-eqiad and beta-update-database-eqiad being stuck waiting for an available executor on deployment-bastion.

Marking the node offline and online doesn't remove the lock :-/

The executor threads have:

"Gearman worker deployment-bastion.eqiad_exec-1" prio=5 WAITING
	java.lang.Object.wait(Native Method)
	java.lang.Object.wait(Object.java:503)
	hudson.remoting.AsyncFutureImpl.get(AsyncFutureImpl.java:73)
	hudson.plugins.gearman.StartJobWorker.safeExecuteFunction(StartJobWorker.java:196)
	hudson.plugins.gearman.StartJobWorker.executeFunction(StartJobWorker.java:114)
	org.gearman.worker.AbstractGearmanFunction.call(AbstractGearmanFunction.java:125)
	org.gearman.worker.AbstractGearmanFunction.call(AbstractGearmanFunction.java:22)
	hudson.plugins.gearman.MyGearmanWorkerImpl.submitFunction(MyGearmanWorkerImpl.java:593)
	hudson.plugins.gearman.MyGearmanWorkerImpl.work(MyGearmanWorkerImpl.java:328)
	hudson.plugins.gearman.AbstractWorkerThread.run(AbstractWorkerThread.java:166)
	java.lang.Thread.run(Thread.java:745)

"Gearman worker deployment-bastion.eqiad_exec-2" prio=5 TIMED_WAITING
	java.lang.Object.wait(Native Method)
	hudson.plugins.gearman.NodeAvailabilityMonitor.lock(NodeAvailabilityMonitor.java:83)
	hudson.plugins.gearman.MyGearmanWorkerImpl.sendGrabJob(MyGearmanWorkerImpl.java:380)
	hudson.plugins.gearman.MyGearmanWorkerImpl.processSessionEvent(MyGearmanWorkerImpl.java:421)
	hudson.plugins.gearman.MyGearmanWorkerImpl.work(MyGearmanWorkerImpl.java:320)
	hudson.plugins.gearman.AbstractWorkerThread.run(AbstractWorkerThread.java:166)
	java.lang.Thread.run(Thread.java:745)

"Gearman worker deployment-bastion.eqiad_exec-3" prio=5 TIMED_WAITING
	java.lang.Object.wait(Native Method)
	hudson.plugins.gearman.NodeAvailabilityMonitor.lock(NodeAvailabilityMonitor.java:83)
	hudson.plugins.gearman.MyGearmanWorkerImpl.sendGrabJob(MyGearmanWorkerImpl.java:380)
	hudson.plugins.gearman.MyGearmanWorkerImpl.processSessionEvent(MyGearmanWorkerImpl.java:421)
	hudson.plugins.gearman.MyGearmanWorkerImpl.work(MyGearmanWorkerImpl.java:320)
	hudson.plugins.gearman.AbstractWorkerThread.run(AbstractWorkerThread.java:166)
	java.lang.Thread.run(Thread.java:745)

"Gearman worker deployment-bastion.eqiad_exec-4" prio=5 TIMED_WAITING
	java.lang.Object.wait(Native Method)
	hudson.plugins.gearman.NodeAvailabilityMonitor.lock(NodeAvailabilityMonitor.java:83)
	hudson.plugins.gearman.MyGearmanWorkerImpl.sendGrabJob(MyGearmanWorkerImpl.java:380)
	hudson.plugins.gearman.MyGearmanWorkerImpl.processSessionEvent(MyGearmanWorkerImpl.java:421)
	hudson.plugins.gearman.MyGearmanWorkerImpl.work(MyGearmanWorkerImpl.java:320)
	hudson.plugins.gearman.AbstractWorkerThread.run(AbstractWorkerThread.java:166)
	java.lang.Thread.run(Thread.java:745)

"Gearman worker deployment-bastion.eqiad_exec-5" prio=5 TIMED_WAITING
	java.lang.Object.wait(Native Method)
	hudson.plugins.gearman.NodeAvailabilityMonitor.lock(NodeAvailabilityMonitor.java:83)
	hudson.plugins.gearman.MyGearmanWorkerImpl.sendGrabJob(MyGearmanWorkerImpl.java:380)
	hudson.plugins.gearman.MyGearmanWorkerImpl.processSessionEvent(MyGearmanWorkerImpl.java:421)
	hudson.plugins.gearman.MyGearmanWorkerImpl.work(MyGearmanWorkerImpl.java:320)
	hudson.plugins.gearman.AbstractWorkerThread.run(AbstractWorkerThread.java:166)
	java.lang.Thread.run(Thread.java:745)

The node is named deployment-bastion-eqiad, with a label deployment-bastion-eqiad. Jobs are tied to deployment-bastion-eqiad.

The workaround I found was to remove the label from the node. Once done, the jobs shows in the queue with 'no node having label deployment-bastion-eqiad'. I then applied the label again on the host and the jobs managed to run.

So maybe it is an issue in Jenkins itself :-}

Change 227440 had a related patch set uploaded (by Hashar):
beta: expand {datacenter} to 'eqiad'

https://gerrit.wikimedia.org/r/227440

Change 227441 had a related patch set uploaded (by Hashar):
beta: disambiguate Jenkins label from node name

https://gerrit.wikimedia.org/r/227441

Change 227440 merged by jenkins-bot:
beta: expand {datacenter} to 'eqiad'

https://gerrit.wikimedia.org/r/227440

Change 227441 merged by jenkins-bot:
beta: disambiguate Jenkins label from node name

https://gerrit.wikimedia.org/r/227441

I renamed the Jenkins label to disambiguate the node name and the label (now BetaClusterBastion).

That still happens from time to time with Jenkins 1.625.3 and the Gearman Plugin 0.1.3.3.01da2d4 (which is 1.3.3 + https://review.openstack.org/#/c/252768/ ).

To remove the deadlock one can either:

hashar updated the task description. (Show Details)
hashar updated the task description. (Show Details)
hashar moved this task from Backlog to Reported Upstream on the Upstream board.

Upstream https://review.openstack.org/#/c/252768/ has been abandoned in favor of https://review.openstack.org/#/c/271543/ . It uses a different internal API which should no more return null. Specially replaces:

- Computer.currentComputer()
+ Jenkins.getActiveInstance().getComputer("")

But that is solely to properly get the master node on which we run no job. So unlikely to fix anything for us.

Updated the Gearman plugin to 0.1.3.3.a5164d6

Locked up again. I don't quite understand the instructions in the summary that say you can disable the gearman plugin without restarting Jenkins. Doesn't enable/disable of a plugin require a restart?

Sorry @bd808 the phrasing wasn't perfect. There is no need to disable the Gearman plugin, just have to disconnect it from the Gearman server which is done in the Jenkins manage page:

jenkins-gearman-disconnect.png (205×590 px, 17 KB)

Doing so causes the Gearman client on Jenkins to disconnect from Zuul Gearman server and stop overriding the Jenkins executors/reset state which get rid of the deadlock.

I have updated the task summary with above screenshot (that shows my leet Paint skills).

I have updated the task summary with above screenshot (that shows my leet Paint skills).

Thanks @hashar. I'd forgotten about that setting and was instead thinking that the instruction was to disable/enable the plugin at https://integration.wikimedia.org/ci/pluginManager/.

Danny_B renamed this task from [upstream] Jenkins Gearman plugin has deadlock on executor threads (was: Beta Cluster stopped receiving code updates (beta-update-databases-eqiad hung) to Jenkins Gearman plugin has deadlock on executor threads (was: Beta Cluster stopped receiving code updates (beta-update-databases-eqiad hung).May 31 2016, 3:14 PM
Danny_B removed a subscriber: wikibugs-l-list.

Blind search-and-replace was incorrect here. This is not a Language-Team bug, we were only CCed.

hashar claimed this task.
hashar lowered the priority of this task from High to Medium.

This still happen albeit very rarely nowadays to a point it is almost a non issue. I have only noticed it once over the last few months and the root cause was unrelated (thread starvation in the Jenkins SSH plugin that caused it to no more properly connect slaves).

Assuming it is fixed.

This still happen albeit very rarely nowadays to a point it is almost a non issue. I have only noticed it once over the last few months and the root cause was unrelated (thread starvation in the Jenkins SSH plugin that caused it to no more properly connect slaves).

So, this is still happening but I'm unsure of the root cause. It happened again today and happened a few weeks ago: http://tools.wmflabs.org/sal/log/AV9ZS8j8F4fsM4DBdUuo (2017-10-26). Once a month is annoying.

Assuming it is fixed.

Well, something isn't :P

That still happens indeed (T181313). Maybe we can move the beta cluster jobs to a dedicated/standalone Jenkins instance.

Potentially we would create a dedicated Jenkins to drive beta cluster which is T183164. Unassigning since I am focusing on other duties.

Change 419674 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] Use a Zuul mutex for the coverage patch jobs

https://gerrit.wikimedia.org/r/419674

Change 419674 merged by jenkins-bot:
[integration/config@master] Use a Zuul mutex for the coverage patch jobs

https://gerrit.wikimedia.org/r/419674

This happens once in a while. It's some sort of deadlock in Jenkins itself. Here's how I generally try to resolve it:

Sometimes you have to do this whole dance twice before Jenkins realizes that the there are a bunch of executors that it can use.

This deadlock seems to happen more often than not following or during a database update that is taking a while to complete.

^ just had to do this for deployment-tin

Mentioned in SAL (#wikimedia-releng) [2018-07-04T20:15:19Z] <Reedy> beta unbroke beta code autodeploy T72597

Mentioned in SAL (#wikimedia-releng) [2019-07-11T09:06:25Z] <hashar> beta cluster jobs are dead locked. Taking a thread dump in case it helps figure out what is going on. T72597

In the Jenkins log bucket hudson.model.Queue.

First the builds in the queue:

FINE hudson.model.Queue

Queue maintenance started on hudson.model.Queue@2ea587fb with Queue.Snapshot{
waitingList=[
  hudson.model.Queue$WaitingItem:hudson.model.FreeStyleProject@1956d4b9[mwgate-node10-docker]:116546,
  hudson.model.Queue$WaitingItem:hudson.model.FreeStyleProject@35e87319[mwext-php72-phan-docker]:116547,
  hudson.model.Queue$WaitingItem:hudson.model.FreeStyleProject@fc512c3[mwext-php70-phan-seccheck-docker]:116548
];
blockedProjects=[
  hudson.model.Queue$BlockedItem:hudson.model.FreeStyleProject@34bd6b8b[beta-mediawiki-config-update-eqiad]:113951,
  hudson.model.Queue$BlockedItem:hudson.model.FreeStyleProject@4988adf9[beta-code-update-eqiad]:113983];
  buildables=[
    hudson.model.Queue$BuildableItem:hudson.model.FreeStyleProject@38d6ee0d[beta-scap-eqiad]:113952,
    hudson.model.Queue$BuildableItem:hudson.model.FreeStyleProject@c175473[beta-update-databases-eqiad]:114190,
    hudson.model.Queue$BuildableItem:hudson.model.FreeStyleProject@7728417c[beta-publish-deb]:116220];
    pendings=[]
}
FINER hudson.model.Queue

Failed to map hudson.model.Queue$BuildableItem:hudson.model.FreeStyleProject@38d6ee0d[beta-scap-eqiad]:113952 to executors.
candidates=[]
parked=[
JobOffer[integration-r-lang-01 #0], JobOffer[integration-slave-docker-1043 #2], JobOffer[integration-slave-docker-1050 #3], JobOffer[saucelabs-01 #0], JobOffer[saucelabs-02 #0], JobOffer[deployment-deploy01 #3], JobOffer[integration-slave-jessie-1001 #0], JobOffer[integration-castor03 #0], JobOffer[compiler1002.puppet-diffs.eqiad.wmflabs #0], JobOffer[integration-slave-docker-1050 #0], JobOffer[integration-slave-jessie-1002 #0], JobOffer[compiler1001.puppet-diffs.eqiad.wmflabs #0], JobOffer[integration-slave-jessie-1004 #0], JobOffer[integration-slave-docker-1043 #0], JobOffer[integration-slave-docker-1040 #1], JobOffer[integration-slave-docker-1040 #2], JobOffer[integration-slave-docker-1043 #1], JobOffer[webperformance #0], JobOffer[deployment-deploy01 #2], JobOffer[integration-slave-docker-1058 #0], JobOffer[integration-slave-docker-1059 #3], JobOffer[integration-trigger-01 #4], JobOffer[integration-slave-docker-1048 #0], JobOffer[integration-trigger-01 #7], JobOffer[integration-trigger-01 #9], JobOffer[deployment-deploy01 #1], JobOffer[integration-slave-docker-1054 #2], JobOffer[integration-slave-docker-1050 #1], JobOffer[integration-slave-docker-1059 #1], JobOffer[integration-slave-docker-1059 #0], JobOffer[integration-trigger-01 #6], JobOffer[integration-trigger-01 #5], JobOffer[integration-slave-docker-1058 #1], JobOffer[integration-trigger-01 #1], JobOffer[integration-slave-docker-1051 #2], JobOffer[integration-slave-docker-1054 #1], JobOffer[integration-trigger-01 #3], JobOffer[contint1001 #2], JobOffer[contint1001 #0], JobOffer[deployment-deploy01 #0], JobOffer[integration-slave-docker-1048 #1], JobOffer[integration-slave-docker-1051 #0], JobOffer[integration-trigger-01 #0], JobOffer[integration-trigger-01 #2]
]

It has JobOffer instances for deployment-deploy01 #0 to #3 which would be the four executors on that Jenkins instance.

Same message for all three jobs: beta-scap-eqiad, beta-update-databases-eqiad and beta-publish-deb.

core/src/main/java/hudson/model/Queue.java
List<JobOffer> candidates = new ArrayList<>(parked.size());
List<CauseOfBlockage> reasons = new ArrayList<>(parked.size());
for (JobOffer j : parked.values()) {
    CauseOfBlockage reason = j.getCauseOfBlockage(p);
    if (reason == null) {
        LOGGER.log(Level.FINEST,
                "{0} is a potential candidate for task {1}",
                new Object[]{j, taskDisplayName});
        candidates.add(j);
    } else {
        LOGGER.log(Level.FINEST, "{0} rejected {1}: {2}", new Object[] {j, taskDisplayName, reason});
        reasons.add(reason);
    }
}

MappingWorksheet ws = new MappingWorksheet(p, candidates);
Mapping m = loadBalancer.map(p.task, ws);
if (m == null) {
    // if we couldn't find the executor that fits,
    // just leave it in the buildables list and
    // check if we can execute other projects
    LOGGER.log(Level.FINER, "Failed to map {0} to executors. candidates={1} parked={2}",
            new Object[]{p, candidates, parked.values()});
    p.transientCausesOfBlockage = reasons.isEmpty() ? null : reasons;
    continue;
}

Mentioned in SAL (#wikimedia-releng) [2019-07-11T09:48:47Z] <hashar> jenkins: add more log details to hudson.model.Queue (FINER > FINEST) https://integration.wikimedia.org/ci/log/Jenkins%20Queue/configure # T72597

The Jenkins logger ( https://integration.wikimedia.org/ci/log/Jenkins%20Queue/ ) missed the FINEST log level. The reasons are the same as shown in the web gui, for example:

JobOffer[deployment-deploy01 #0] rejected beta-update-databases-eqiad: Waiting for next available executor on ‘deployment-deploy01’

The Gearman plugin does implement a canTake method:

src/main/java/hudson/plugins/gearman/NodeAvailabilityMonitor.java
public boolean canTake(Queue.BuildableItem item)
{
    // Jenkins calls this from within the scheduler maintenance
    // function (while owning the queue monitor).  If we are
    // locked, only allow the build we are expecting to run.
    logger.debug("AvailabilityMonitor canTake request for " +
                 workerHoldingLock);

    NodeParametersAction param = item.getAction(NodeParametersAction.class);
    if (param != null) {
        logger.debug("AvailabilityMonitor canTake request for UUID " +
                     param.getUuid() + " expecting " + expectedUUID);

        if (expectedUUID == param.getUuid()) {
            return true;
        }
    }
    return (workerHoldingLock == null);
}

And in the log:

Jul 11, 2019 10:01:51 AM FINE hudson.plugins.gearman.NodeAvailabilityMonitor canTake
AvailabilityMonitor canTake request for deployment-deploy01_exec-0
Jul 11, 2019 10:01:51 AM FINE hudson.plugins.gearman.NodeAvailabilityMonitor canTake
AvailabilityMonitor canTake request for deployment-deploy01_exec-0
Jul 11, 2019 10:01:51 AM FINE hudson.plugins.gearman.NodeAvailabilityMonitor canTake
AvailabilityMonitor canTake request for deployment-deploy01_exec-0
Jul 11, 2019 10:01:51 AM FINE hudson.plugins.gearman.NodeAvailabilityMonitor canTake
AvailabilityMonitor canTake request for deployment-deploy01_exec-0
Jul 11, 2019 10:01:51 AM FINE hudson.plugins.gearman.NodeAvailabilityMonitor canTake
AvailabilityMonitor canTake request for deployment-deploy01_exec-0
Jul 11, 2019 10:01:51 AM FINE hudson.plugins.gearman.NodeAvailabilityMonitor canTake
AvailabilityMonitor canTake request for deployment-deploy01_exec-0
Jul 11, 2019 10:01:51 AM FINE hudson.plugins.gearman.NodeAvailabilityMonitor canTake
AvailabilityMonitor canTake request for deployment-deploy01_exec-0
Jul 11, 2019 10:01:51 AM FINE hudson.plugins.gearman.NodeAvailabilityMonitor canTake
AvailabilityMonitor canTake request for deployment-deploy01_exec-0
Jul 11, 2019 10:01:51 AM FINE hudson.plugins.gearman.NodeAvailabilityMonitor canTake
AvailabilityMonitor canTake request for deployment-deploy01_exec-0
Jul 11, 2019 10:01:51 AM FINE hudson.plugins.gearman.NodeAvailabilityMonitor canTake
AvailabilityMonitor canTake request for deployment-deploy01_exec-0
Jul 11, 2019 10:01:51 AM FINE hudson.plugins.gearman.NodeAvailabilityMonitor canTake
AvailabilityMonitor canTake request for deployment-deploy01_exec-0
Jul 11, 2019 10:01:51 AM FINE hudson.plugins.gearman.NodeAvailabilityMonitor canTake
AvailabilityMonitor canTake request for deployment-deploy01_exec-0

So it seems blocked because there are no parameters somehow even though there should at least be a uuid/id :-\

A related issues are the mwext-codehealth jobs, they are made to not run concurrently. Sometime one would see several of them pending in the Jenkins build queue and some of the Jenkins agents are idling although they should be running jobs.

The builds are queued by Jenkins, but the Gearman plugin already assigned a node for those builds. The node assignment can be seen in /var/lib/jenkins/queue.xml:

<hudson.model.Queue_-State>
  <items>
    <hudson.model.Queue_-BlockedItem>
      <actions>
        <hudson.plugins.gearman.NodeAssignmentAction plugin="gearman-plugin@0.2.0.3.e27817f">
          <labelAtom>integration-agent-docker-1009</labelAtom>
        </hudson.plugins.gearman.NodeAssignmentAction>
...
    <hudson.model.Queue_-BlockedItem>
      <actions>
        <hudson.plugins.gearman.NodeAssignmentAction plugin="gearman-plugin@0.2.0.3.e27817f">
          <labelAtom>integration-agent-docker-1005</labelAtom>
        </hudson.plugins.gearman.NodeAssignmentAction>

I don't have the details, but Gearman is thus unable to use any of the executors on those two nodes until the build queued by Jenkins starts executing.

It might be related to the lock we occasionally have for deployment-prep.

I took a heap dump on contint2001 which is at /var/lib/jenkins/202201281527.hprof

Mentioned in SAL (#wikimedia-releng) [2022-06-30T22:02:36Z] <TheresNoTime> unstuck beta-mediawiki-config-update-eqiad jobs, will comment at T72597

The last few "sets" of beta-mediawiki-config-update-eqiad jobs have got stuck and needed manual actions (i.e. cancelling all other pending beta deployment jobs repeatedly until the backlog of beta-mediawiki-config-update-eqiad jobs have completed)

To note, beta-scap-sync-world gets stuck waiting on beta-mediawiki-config-update-eqiad with the error

#57646
cancel this build
(pending—Waiting for next available executor on ‘deployment-deploy03’; ‘contint1001’ doesn’t have label ‘BetaClusterBastion’; ‘contint2001’ doesn’t have label ‘BetaClusterBastion’; ‘integration-agent-docker-1023’ doesn’t have label ‘BetaClusterBastion’; ‘integration-agent-docker-1024’ doesn’t have label ‘BetaClusterBastion’; ‘integration-agent-docker-1025’ doesn’t have label ‘BetaClusterBastion’; ‘integration-agent-docker-1026’ doesn’t have label ‘BetaClusterBastion’; ‘integration-agent-docker-1027’ doesn’t have label ‘BetaClusterBastion’; ‘integration-agent-docker-1028’ doesn’t have label ‘BetaClusterBastion’; ‘integration-agent-docker-1029’ doesn’t have label ‘BetaClusterBastion’; ‘integration-agent-docker-1030’ doesn’t have label ‘BetaClusterBastion’; ‘integration-agent-docker-1031’ doesn’t have label ‘BetaClusterBastion’; ‘integration-agent-docker-1032’ doesn’t have label ‘BetaClusterBastion’; ‘integration-agent-docker-1033’ doesn’t have label ‘BetaClusterBastion’; ‘integration-agent-docker-1034’ doesn’t have label ‘BetaClusterBastion’; ‘integration-agent-docker-1035’ doesn’t have label ‘BetaClusterBastion’; ‘integration-agent-docker-1036’ doesn’t have label ‘BetaClusterBastion’; ‘integration-agent-docker-1037’ doesn’t have label ‘BetaClusterBastion’; ‘integration-agent-docker-1038’ doesn’t have label ‘BetaClusterBastion’; ‘integration-agent-docker-1039’ doesn’t have label ‘BetaClusterBastion’; ‘integration-agent-pkgbuilder-1001’ doesn’t have label ‘BetaClusterBastion’; ‘integration-agent-pkgbuilder-1002’ doesn’t have label ‘BetaClusterBastion’; ‘integration-agent-puppet-docker-1003’ doesn’t have label ‘BetaClusterBastion’; ‘integration-agent-qemu-1003’ doesn’t have label ‘BetaClusterBastion’; ‘integration-castor05’ doesn’t have label ‘BetaClusterBastion’; ‘pcc-worker1001.puppet-diffs.eqiad1.wikimedia.cloud’ doesn’t have label ‘BetaClusterBastion’; ‘pcc-worker1002.puppet-diffs.eqiad1.wikimedia.cloud’ doesn’t have label ‘BetaClusterBastion’; ‘pcc-worker1003.puppet-diffs.eqiad1.wikimedia.cloud’ doesn’t have label ‘BetaClusterBastion’)

nb. just sat and watched a set of deploys (what else do you do at 11pm?) — this seems to occur when a beta-mediawiki-config-update-eqiad is running and a beta-code-update-eqiad job is triggered via timer. There's either no lockfile to prevent the two from running at the same time, or it ignores it?

Mentioned in SAL (#wikimedia-releng) [2022-07-07T21:10:41Z] <TheresNoTime> clear stuck beta deployment jobs, T72597

Mentioned in SAL (#wikimedia-releng) [2022-07-07T22:42:26Z] <TheresNoTime> clear stuck beta deployment jobs (again), T72597

Mentioned in SAL (#wikimedia-releng) [2022-08-02T07:55:36Z] <TheresNoTime> cleared stuck beta deployment jobs T72597

Mentioned in SAL (#wikimedia-releng) [2022-08-04T10:01:13Z] <TheresNoTime> clearing out stuck beta deployment jobs T314378 T72597

I am tentatively marking this as resolved since I haven't seen it happen in quite a while.