This will transferred to a new service in the future (per T185319), but depending on the time scale it might need another update to Stretch/Buster beforehand. kraz uses a custom, patched version of ircd-ratbox.
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | MoritzMuehlenhoff | T224549 Track remaining jessie systems in production | |||
Resolved | MoritzMuehlenhoff | T224579 Migrate irc.wikimedia.org/kraz to Buster | |||
Resolved | taavi | T277081 Replace deployment-ircd with a Buster host |
Event Timeline
Hi, is this going to happen? Beta cluster has also an IRC server running Jessie, and in an effort of getting rid of Jessie on beta I'm offering it as a test opportunity for the production upgrade.
That sounds like good idea! There's a handful of packages which need to be made available for buster, I'll update the task when that's done
Mentioned in SAL (#wikimedia-operations) [2021-03-09T15:56:57Z] <moritzm> imported prometheus-ircd-exporter 0.2 to apt.wikimedia.org T224579
I've imported all the custom Buster packages needed by the mw_rc_irc role. If you need anything merged to puppet.git (Hiera or so), please ping me on IRC (moritzm).
Change 670829 had a related patch set uploaded (by Muehlenhoff; owner: Muehlenhoff):
[operations/puppet@production] Assign mw_rc_irc role to irc2001
Change 670913 had a related patch set uploaded (by Legoktm; owner: Legoktm):
[operations/mediawiki-config@master] Support having multiple IRC feed servers
Change 670914 had a related patch set uploaded (by Krinkle; owner: Legoktm):
[operations/mediawiki-config@master] Define IRC feed servers as an array in {Production,Labs}Services.php
Change 670915 had a related patch set uploaded (by Krinkle; owner: Legoktm):
[operations/mediawiki-config@master] Remove back-compat from when IRC feed servers was a string
Change 670829 merged by Muehlenhoff:
[operations/puppet@production] Assign mw_rc_irc role to irc2001
Change 670913 merged by jenkins-bot:
[operations/mediawiki-config@master] Support having multiple IRC feed servers
Change 670914 merged by jenkins-bot:
[operations/mediawiki-config@master] Define IRC feed servers as an array in {Production,Labs}Services.php
Mentioned in SAL (#wikimedia-operations) [2021-03-15T23:23:04Z] <legoktm@deploy1002> Synchronized wmf-config/CommonSettings.php: Support having multiple IRC feed servers (T224579) (duration: 00m 58s)
Mentioned in SAL (#wikimedia-operations) [2021-03-15T23:24:41Z] <legoktm@deploy1002> Synchronized wmf-config/: Define IRC feed servers as an array in {Production,Labs}Services.php (T224579) (duration: 00m 59s)
Change 670915 merged by jenkins-bot:
[operations/mediawiki-config@master] Remove back-compat from when IRC feed servers was a string
Mentioned in SAL (#wikimedia-operations) [2021-03-15T23:31:27Z] <legoktm@deploy1002> Synchronized wmf-config/CommonSettings.php: Remove back-compat from when IRC feed servers was a string (T224579) (duration: 00m 59s)
Change 672687 had a related patch set uploaded (by Muehlenhoff; owner: Muehlenhoff):
[operations/mediawiki-config@master] Add irc2001.wikimedia.org (running buster) as second irc server
Since yesterday, the Prometheus jobs reduced availability alert has been firing about ircd on irc2001. Looking at the logs, there appears to be some breakdown in communication between prometheus-ircd-exporter and ircd:
Mar 16 15:28:31 irc2001 prometheus-ircd-exporter[13830]: ERROR:__main__:Failed to connect to IRC server Mar 16 15:28:31 irc2001 prometheus-ircd-exporter[13830]: ERROR:__main__:Failed to close connection to IRC server Mar 16 15:28:31 irc2001 prometheus-ircd-exporter[13830]: Traceback (most recent call last): Mar 16 15:28:31 irc2001 prometheus-ircd-exporter[13830]: File "/usr/lib/python2.7/SocketServer.py", line 599, in process_request_thread Mar 16 15:28:31 irc2001 prometheus-ircd-exporter[13830]: self.finish_request(request, client_address) Mar 16 15:28:31 irc2001 prometheus-ircd-exporter[13830]: File "/usr/lib/python2.7/SocketServer.py", line 334, in finish_request Mar 16 15:28:31 irc2001 prometheus-ircd-exporter[13830]: self.RequestHandlerClass(request, client_address, self) Mar 16 15:28:31 irc2001 prometheus-ircd-exporter[13830]: File "/usr/lib/python2.7/SocketServer.py", line 655, in __init__ Mar 16 15:28:31 irc2001 prometheus-ircd-exporter[13830]: self.handle() Mar 16 15:28:31 irc2001 prometheus-ircd-exporter[13830]: File "/usr/lib/python2.7/BaseHTTPServer.py", line 340, in handle Mar 16 15:28:31 irc2001 prometheus-ircd-exporter[13830]: self.handle_one_request() Mar 16 15:28:31 irc2001 prometheus-ircd-exporter[13830]: File "/usr/lib/python2.7/BaseHTTPServer.py", line 328, in handle_one_request Mar 16 15:28:31 irc2001 prometheus-ircd-exporter[13830]: method() Mar 16 15:28:31 irc2001 prometheus-ircd-exporter[13830]: File "/usr/lib/python2.7/dist-packages/prometheus_client/exposition.py", line 151, in do_GET Mar 16 15:28:31 irc2001 prometheus-ircd-exporter[13830]: output = encoder(registry) Mar 16 15:28:31 irc2001 prometheus-ircd-exporter[13830]: File "/usr/lib/python2.7/dist-packages/prometheus_client/openmetrics/exposition.py", line 14, in generate_latest Mar 16 15:28:31 irc2001 prometheus-ircd-exporter[13830]: for metric in registry.collect(): Mar 16 15:28:31 irc2001 prometheus-ircd-exporter[13830]: File "/usr/lib/python2.7/dist-packages/prometheus_client/registry.py", line 75, in collect Mar 16 15:28:31 irc2001 prometheus-ircd-exporter[13830]: for metric in collector.collect(): Mar 16 15:28:31 irc2001 prometheus-ircd-exporter[13830]: File "/usr/bin/prometheus-ircd-exporter", line 55, in collect Mar 16 15:28:31 irc2001 prometheus-ircd-exporter[13830]: irc = socket.socket(socket.AF_INET, socket.SOCK_STREAM) Mar 16 15:28:31 irc2001 prometheus-ircd-exporter[13830]: File "/usr/lib/python2.7/socket.py", line 191, in __init__ Mar 16 15:28:31 irc2001 prometheus-ircd-exporter[13830]: _sock = _realsocket(family, type, proto) Mar 16 15:28:31 irc2001 prometheus-ircd-exporter[13830]: error: [Errno 24] Too many open files Mar 16 15:29:31 irc2001 prometheus-ircd-exporter[13830]: Traceback (most recent call last): Mar 16 15:29:31 irc2001 prometheus-ircd-exporter[13830]: File "/usr/lib/python2.7/SocketServer.py", line 599, in process_request_thread Mar 16 15:29:31 irc2001 prometheus-ircd-exporter[13830]: self.finish_request(request, client_address) Mar 16 15:29:31 irc2001 prometheus-ircd-exporter[13830]: File "/usr/lib/python2.7/SocketServer.py", line 334, in finish_request Mar 16 15:29:31 irc2001 prometheus-ircd-exporter[13830]: self.RequestHandlerClass(request, client_address, self) Mar 16 15:29:31 irc2001 prometheus-ircd-exporter[13830]: File "/usr/lib/python2.7/SocketServer.py", line 655, in __init__ Mar 16 15:29:31 irc2001 prometheus-ircd-exporter[13830]: self.handle() Mar 16 15:29:31 irc2001 prometheus-ircd-exporter[13830]: File "/usr/lib/python2.7/BaseHTTPServer.py", line 340, in handle Mar 16 15:29:31 irc2001 prometheus-ircd-exporter[13830]: self.handle_one_request() Mar 16 15:29:31 irc2001 prometheus-ircd-exporter[13830]: File "/usr/lib/python2.7/BaseHTTPServer.py", line 328, in handle_one_request Mar 16 15:29:31 irc2001 prometheus-ircd-exporter[13830]: method() Mar 16 15:29:31 irc2001 prometheus-ircd-exporter[13830]: File "/usr/lib/python2.7/dist-packages/prometheus_client/exposition.py", line 151, in do_GET Mar 16 15:29:31 irc2001 prometheus-ircd-exporter[13830]: output = encoder(registry) Mar 16 15:29:31 irc2001 prometheus-ircd-exporter[13830]: File "/usr/lib/python2.7/dist-packages/prometheus_client/openmetrics/exposition.py", line 14, in generate_latest Mar 16 15:29:31 irc2001 prometheus-ircd-exporter[13830]: for metric in registry.collect(): Mar 16 15:29:31 irc2001 prometheus-ircd-exporter[13830]: File "/usr/lib/python2.7/dist-packages/prometheus_client/registry.py", line 75, in collect Mar 16 15:29:31 irc2001 prometheus-ircd-exporter[13830]: for metric in collector.collect(): Mar 16 15:29:31 irc2001 prometheus-ircd-exporter[13830]: File "/usr/bin/prometheus-ircd-exporter", line 55, in collect Mar 16 15:29:31 irc2001 prometheus-ircd-exporter[13830]: irc = socket.socket(socket.AF_INET, socket.SOCK_STREAM) Mar 16 15:29:31 irc2001 prometheus-ircd-exporter[13830]: File "/usr/lib/python2.7/socket.py", line 191, in __init__ Mar 16 15:29:31 irc2001 prometheus-ircd-exporter[13830]: _sock = _realsocket(family, type, proto) Mar 16 15:29:31 irc2001 prometheus-ircd-exporter[13830]: error: [Errno 24] Too many open files
It looks like the exporter was stuck in a loop:
[pid 18567] <... recvfrom resumed> "", 8192, 0, NULL, NULL) = 0 [pid 18486] <... futex resumed> ) = 0 [pid 18567] futex(0x55a8bff89fb0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 18486] recvfrom(1020, <unfinished ...> [pid 18567] <... futex resumed> ) = 0 [pid 18486] <... recvfrom resumed> "", 8192, 0, NULL, NULL) = 0 [pid 18567] recvfrom(1022, <unfinished ...> [pid 18486] futex(0x55a8bff89fb0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 18567] <... recvfrom resumed> "", 8192, 0, NULL, NULL) = 0 [pid 18486] <... futex resumed> ) = 0 [pid 18567] futex(0x55a8bff89fb0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 18486] recvfrom(1020, <unfinished ...> [pid 18567] <... futex resumed> ) = 0 [pid 18486] <... recvfrom resumed> "", 8192, 0, NULL, NULL) = 0 [pid 18567] recvfrom(1022, <unfinished ...> [pid 18486] futex(0x55a8bff89fb0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 18567] <... recvfrom resumed> "", 8192, 0, NULL, NULL) = 0 [pid 18486] <... futex resumed> ) = 0 [pid 18567] futex(0x55a8bff89fb0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 18486] recvfrom(1020, <unfinished ...> [pid 18567] <... futex resumed> ) = 0 [pid 18486] <... recvfrom resumed> "", 8192, 0, NULL, NULL) = 0 [pid 18567] futex(0x55a8bff89fb0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 18366] <... futex resumed> ) = 0 [pid 18567] <... futex resumed> ) = 0 [pid 18486] futex(0x55a8bff89fb0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
And connections from Prometheus kept piling up. AFAIK the service/exporter is not owned ATM, I've restarted the exporter but this is obviously bound to happen again.
Sure enough, the exporter is out of FDs again. I'm +1 to just remove the exporter since the service doesn't have an owner, the exporter is python2 and afaict we use the metrics anyways. Thoughts ?
+1, let's kill it with fire. It was never really used (and only added to replace an old Diamond collector), in the near future it will be replaced with something new anyway (which will have it's custom metrics), fixing this seems like flogging a dead horse.
Change 673972 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: remove ircd-exporter
Change 673972 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: remove ircd-exporter
Change 672687 merged by jenkins-bot:
[operations/mediawiki-config@master] Add irc2001.wikimedia.org (running buster) as second irc server
Mentioned in SAL (#wikimedia-operations) [2021-03-23T18:10:38Z] <legoktm@deploy1002> Synchronized wmf-config/ProductionServices.php: Add irc2001.wikimedia.org (running buster) as second irc server (T224579) (duration: 01m 08s)
Events are now going to irc2001.wikimedia.org. I watched #en.wikipedia on both kraz and irc2001 for a few minutes and saw identical output (note that channels won't exist on the new server until an edit/log entry comes through).
From T123729: Migrate irc.wikimedia.org to Jessie:
- Announce in Tech/News, wikitech-l, wikitech-ambassadors that we'll be switching irc.wikimedia.org over to a new server on XX. Include a reminder that clients should switch to eventstreams if possible.
- On XX, switch irc.wikimedia.org DNS to point to irc2001. Clients can switch to the new server manually if they want. All new connections will go to irc2001.
- On XX + 1 week (or shorter?), shut down kraz.wikimedia.org. All clients should automatically reconnect to irc2001.
When should XX be?
Moritz is going to switch DNS and reboot kraz "Thursday during the European morning", announcement to be sent shortly.
For Tech News / User-notice:
- The Wikimedia IRC RC feeds have been switched to a new server. Make sure all tools automatically reconnect to irc.wikimedia.org and not the name of any specific server. Users should also consider switching to EventStreams, a more modern alternative.
Change 674617 had a related patch set uploaded (by Muehlenhoff; owner: Muehlenhoff):
[operations/dns@master] Point irc.wikimedia.org to irc2001
Mentioned in SAL (#wikimedia-operations) [2021-03-24T15:42:12Z] <moritzm> reduce RAM for irc2001 to 2G, was originally created with 8 G T224579
Change 674617 merged by Muehlenhoff:
[operations/dns@master] Point irc.wikimedia.org to irc2001
I've rebooted kraz to force the remaining bots still connected to kraz to reconnect to irc2001.w.o.
Those connections are quite long-lived, I sampled some stats for bots connected to #de.wikipedia: A day after the CNAME failover half of the bots had moved to irc2001, but two weeks later 1/3 of the bots were still connected to the old IP (until the reboot of kraz happened).
Change 677806 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):
[operations/mediawiki-config@master] Broadcase IRC events to irc1001 instead of kraz
Change 677806 merged by jenkins-bot:
[operations/mediawiki-config@master] Broadcast IRC events to irc1001 instead of kraz
Mentioned in SAL (#wikimedia-operations) [2021-04-13T23:27:26Z] <legoktm@deploy1002> Synchronized wmf-config/ProductionServices.php: Config: [[gerrit:677806|Broadcast IRC events to irc1001 instead of kraz (T224579)]] (duration: 01m 06s)
cookbooks.sre.hosts.decommission executed by jmm@cumin1001 for hosts: kraz.wikimedia.org
- kraz.wikimedia.org (PASS)
- Downtimed host on Icinga
- Found Ganeti VM
- VM shutdown
- Started forced sync of VMs in Ganeti cluster ganeti01.svc.codfw.wmnet to Netbox
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB
- VM removed
- Started forced sync of VMs in Ganeti cluster ganeti01.svc.codfw.wmnet to Netbox
kraz has been replaced by two Buster instances (irc1001.wikimedia.org and irc2001.wikimedia.org) was eventually removed.