Page MenuHomePhabricator

Establish a systemd timer to remove long-running processes on the bastion in a random and somewhat friendly way
Closed, ResolvedPublic

Description

To have a task to link against, I started making a script to do weighted attacks against processes that are operating outside of recommended uses for the bastions. So far, it gathers the older processes that are running (configurable how old we need), uses the age as a weight and then selects a number of them at random to kill. This was an effort to allow things to sometimes finish if it is just a long-running action that is reasonable but to make it not-so-good. That way people are more likely to use the job grid or k8s while we can improve services overall on the bastions for legitimate uses.

This may also allow relieving some of the trouble caused by using such tight limits on highly shared servers (T218338)

Event Timeline

Bstorm created this task.

Change 635888 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] toolforge: script to make long-running processes on bastions less good

https://gerrit.wikimedia.org/r/635888

Change 635888 merged by Bstorm:
[operations/puppet@production] toolforge: script to make long-running processes on bastions less good

https://gerrit.wikimedia.org/r/635888

Change 637535 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] toolforge bastion: fix the wmcs_wheel_of_misfortune script for py3.5

https://gerrit.wikimedia.org/r/637535

Change 637535 merged by Bstorm:
[operations/puppet@production] toolforge bastion: fix the wmcs_wheel_of_misfortune script for py3.5

https://gerrit.wikimedia.org/r/637535

Change 639224 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] toolsforge bastion: fix an error in the killer script

https://gerrit.wikimedia.org/r/639224

Change 639224 merged by Bstorm:
[operations/puppet@production] toolsforge bastion: fix an error in the killer script

https://gerrit.wikimedia.org/r/639224

I've been running a screen session on the main bastion, so I'm hoping to see my process get killed.

Feedback from @Multichill: screen is a part of his workflow and would like to be able to keep a screen session open without it getting squashed by this service. I think that might be a good point.

Screen and tmux themselves aren't the problem. It's whatever might be running for days inside a screen session, if something is.

Of course a quirk there is shells.

Change 639617 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] toolforge bastion: safelist shells and related procs for the killer

https://gerrit.wikimedia.org/r/639617

Change 639620 had a related patch set uploaded (by BryanDavis; owner: Bryan Davis):
[operations/puppet@production] toolforge bastion: tweak email wording for process killer

https://gerrit.wikimedia.org/r/639620

Change 639641 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] toolforge bastion: reduce number of wheel-of-misfortune runs

https://gerrit.wikimedia.org/r/639641

Change 639620 merged by Bstorm:
[operations/puppet@production] toolforge bastion: tweak email wording for process killer

https://gerrit.wikimedia.org/r/639620

Change 639641 merged by Bstorm:
[operations/puppet@production] toolforge bastion: reduce number of wheel-of-misfortune runs

https://gerrit.wikimedia.org/r/639641

Change 639617 merged by Bstorm:
[operations/puppet@production] toolforge bastion: safelist shells and related procs for the killer

https://gerrit.wikimedia.org/r/639617

Change 639796 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] toolforge bastion: improve the killer a bit

https://gerrit.wikimedia.org/r/639796

Change 639796 merged by Bstorm:
[operations/puppet@production] toolforge bastion: improve the killer a bit

https://gerrit.wikimedia.org/r/639796

At this point, a dryrun only comes up with one sftp session (at least I think that's what I'm looking at). I'm thinking sftp sessions don't need to be kept open for days on end, so that seems ok. Everything else that would have been killed, probably already has been. Screens will be exempt from this service's behavior.

The next round of changes on this likely should focus on https://psutil.readthedocs.io/en/latest/#psutil.cpu_times
It is possible on Linux to check out what the process has been doing and use that instead of wall clock time to kick processes off. When this goes through the proclist in the spin_the_wheel function, it could start by logging things that might be killed on the basis of, say, excessive iowait or user time. That would allow safe experimentation with that kind of algorithm. @zhuyifei1999 if you have any interest in tweaking that sort of setting here, I'll happily review it 😁 I think this might be pretty stable for now, though.

Thanks for adding screen. Can you add /usr/bin/mysql to the whitelist too? Just the client part. Mariadb server will auto disconnect any long open sessions and the client will just reconnect when a session is needed again.

Coming back around to this, mysql is one that could definitely be used inappropriately because you could effectively run a bot as mysql. However, it wouldn't daemonize at least. We could maybe add it since we already monitor for crons and there is a query/session killer.

That said, the script needs at least one more improvement I've noticed. It crashes when there's nothing to consider for killing because of the way we implemented the random.choices() function (stolen from python 3.6).

Change 643336 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] wmcs_wheel_of_misfortune: dont fail for empty lists

https://gerrit.wikimedia.org/r/643336

Change 643337 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] wmcs_wheel_of_misfortune: add mysql to exempt shells

https://gerrit.wikimedia.org/r/643337

Change 643336 merged by Bstorm:
[operations/puppet@production] wmcs_wheel_of_misfortune: dont fail for empty lists

https://gerrit.wikimedia.org/r/643336

Change 643337 merged by Bstorm:
[operations/puppet@production] wmcs_wheel_of_misfortune: add mysql to exempt shells

https://gerrit.wikimedia.org/r/643337

Change 644878 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] wmcs_wheel_of_misfortune: avoid race condition with proc info

https://gerrit.wikimedia.org/r/644878

Change 644878 merged by Bstorm:
[operations/puppet@production] wmcs_wheel_of_misfortune: avoid race condition with proc info

https://gerrit.wikimedia.org/r/644878