Page MenuHomePhabricator

Post-send work sometimes fatals at "Error: The UdpSocket to 127.0.0.1:10514 has been closed" (esp. mwdebug hosts)
Open, MediumPublicPRODUCTION ERROR

Description

This error show up in Logstash whenever I'm using PHP 7 via X-Wikimedia-Debug

PHP Fatal error:  Uncaught LogicException: The UdpSocket to 127.0.0.1:10514 has been closed and can not be written to anymore in /srv/mediawiki/php-1.33.0-wmf.14/vendor/monolog/monolog/src/Monolog/Handler/SyslogUdp/UdpSocket.php:45
Stack trace:
#0 /srv/mediawiki/php-1.33.0-wmf.14/vendor/monolog/monolog/src/Monolog/Handler/SyslogUdp/UdpSocket.php(31): Monolog\Handler\SyslogUdp\UdpSocket->send('<15>Jan 25 21:3...')
#1 /srv/mediawiki/php-1.33.0-wmf.14/vendor/monolog/monolog/src/Monolog/Handler/SyslogUdpHandler.php(47): Monolog\Handler\SyslogUdp\UdpSocket->write('@cee: {"@timest...', '<15>Jan 25 21:3...')
#2 /srv/mediawiki/php-1.33.0-wmf.14/vendor/monolog/monolog/src/Monolog/Handler/AbstractProcessingHandler.php(37): Monolog\Handler\SyslogUdpHandler->write(Array)
#3 /srv/mediawiki/php-1.33.0-wmf.14/vendor/monolog/monolog/src/Monolog/Logger.php(337): Monolog\Handler\AbstractProcessingHandler->handle(Array)
#4 /srv/mediawiki/php-1.33.0-wmf.14/vendor/monolog/monolog/src/Monolog/Logger.php(532): Monolog\Logger->addRecord(100, 'Wikimed in /srv/mediawiki/php-1.33.0-wmf.14/vendor/monolog/monolog/src/Monolog/Handler/SyslogUdp/UdpSocket.php on line 45
typesyslog
hostmwdebug1002
programphp7.2-fpm

Note that these are logged directly from php, and not via MediaWiki's fatal error handler. As such, they don't have any context about which wiki, or which url it happened with.

Also, I haven't seen my web requests aborted with this error. This suggests that maybe this happening post-send.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Is it sporadic or reproducible at will? My first thought was rsyslog not running (or restarting) and thus 10514 udp not open

Krinkle lowered the priority of this task from High to Low.May 7 2019, 4:21 PM
Krinkle added a subscriber: Joe.

Didn't realise it was limited to mwdebug1002. Never checked other servers but @Joe is right that it only happens there. Let's ignore for now.

Krinkle renamed this task from PHP Fatal error: The UdpSocket to 127.0.0.1:10514 has been closed (from Monolog/SyslogUdp) to PHP Fatal error: The UdpSocket to 127.0.0.1:10514 has been closed (from Monolog/SyslogUdp on mwdebug1002).May 7 2019, 4:21 PM
Krinkle moved this task from Untriaged to Mar 2021 on the Wikimedia-production-error board.
mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:07 PM

Still seen regularly. It's somewhat concerning me because we rely on mwdebug1002 to find non-fatal warnings and errors when testing patches and expect that if Logstash/mwdebug1002 is clear, that means the patch is good to go.

If we instead see "The UdpSocket to 127.0.0.1:10514 has been closed and can not be written to" that is first of all confusing to anyone who doesn't know it's specific to mwdebug1002. But more importantly - does this not mean that some messages were dropped and not reported to Logstash?

That's more concerning because it means we're missing messages that could very well be errors.

@Joe If this is specific to mwdebug1002, perhaps re-imaging the server could help to rule out anything that might've gotten out of whack with this one?


EDIT: might be related to T151428: Error when provisioning mediawiki-vagrant: "The UdpSocket to 127.0.0.1:10514 has been closed and can not be written to anymore"

Tagging RelEng for awareness as this means when looking at Logstash/mwdebug1002 during deployments, there may be critical warnings not visible due to them not making it from MW to syslog/Kafka.

Krinkle raised the priority of this task from Low to Medium.Oct 12 2019, 11:19 PM

Server will be reimaged, I will ping here when it is done :)

Mentioned in SAL (#wikimedia-operations) [2019-10-23T13:34:01Z] <effie> disable puppet on mwdebug1002 - T214734

Note that debug servers aren't pooled in etcd like regular production ones, so mwdebug1002 is still serving debug traffic:

$ curl -v -H "X-Wikimedia-Debug: backend=mwdebug1002.eqiad.wmnet" https://en.wikipedia.org/wiki/Main_Page 2>&1 | grep server:
< server: mwdebug1002.eqiad.wmnet

We could temporarily comment it out from debug_proxy's configuration (modules/role/manifests/debug_proxy.pp), which would result in requests with mwdebug1002 in X-W-D getting a 400 error.

@Krinkle Please let me know if the issue on mwdebug1002 persists, as the server has been reimaged :)

Change 551515 had a related patch set uploaded (by Effie Mouzeli; owner: Effie Mouzeli):
[operations/puppet@production] trafficserver: re-enable mwdebug1002

https://gerrit.wikimedia.org/r/551515

Change 551515 merged by Ema:
[operations/puppet@production] trafficserver: re-enable mwdebug1002

https://gerrit.wikimedia.org/r/551515

Looks like it is still happening:

https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-deploy-2019.11.20/syslog?id=AW6KkyhearkxubcmwWlL&_g=h@44136fa

host: mwdebug1002
type: syslog
program:	php7.2-fpm
message:

PHP Fatal error:  Uncaught LogicException: The UdpSocket to 127.0.0.1:10514 has been closed and can not be written to anymore in /srv/mediawiki/php-1.35.0-wmf.5/vendor/monolog/monolog/src/Monolog/Handler/SyslogUdp/UdpSocket.php:45

Steps to reproduce the issue:

  1. Enable WikimediaDebug to mwdebug1002 with Logging enabled.
  2. Disable WikimediaDebug (because T237687#5679746).
  3. Go to https://logstash.wikimedia.org/app/kibana#/dashboard/mwdebug1002

Change 556302 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[operations/puppet@production] Add MOTD to mwdebug1002 warning about T214734

https://gerrit.wikimedia.org/r/556302

Change 556489 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] mwdebug1002: add motd warning

https://gerrit.wikimedia.org/r/556489

Change 556489 merged by CDanis:
[operations/puppet@production] mwdebug1002: add motd warning

https://gerrit.wikimedia.org/r/556489

@jijiki Can we depool mwdebug1002 again so that it can't be used via the ATS & debug_proxy endpoint?

We reversed this to verify that things recovered, but they haven't. That might be more efficient than MOTD, in particular because using mwdebug does not always involve a shell (e.g. Logstash/XHGui options).

Change 556302 abandoned by Gergő Tisza:
Add MOTD to mwdebug1002 warning about T214734

Reason:
If9ca8c (correctly) does this already

https://gerrit.wikimedia.org/r/556302

First of all, we found that this fatal error is present on mwdebug1001, not only mwdebug1002 like we have been assuming in this task.

logstash1001.png (280×1 px, 119 KB)

I have been debugging this today, and after reading UdpSocket.php, my theory is that PHP closes the socket where it has been streaming logs to 127.0.0.1:10514, but then tries to use it again, only to find it has been closed, and thus throwing a fatal error.

new socket:

1576165211.867070 socket(AF_INET, SOCK_DGRAM, IPPROTO_UDP) = 6<socket:[36083338]>

streaming logs:

1576165211.878189 sendto(6<socket:[36083338]>, "<15>Dec 12 15:40:11 mwdebug1002 mediawiki: @cee: {\"@timestamp\":\"2019-12- 
12T15:40:11.877806+00:00\",\"@version\":1,\"host\":\"mwdebug1002\",\"message\":\"MainObjectStash using store 
RedisBagOStuff\",\"type\":\"mediawiki\",\"channel\":\"redis\",\"level\":\"DEBUG\",\"shard\":\"s1\",\"normalized_message\":\"MainObjectStash using store {class}\",\"phpversion\":\"7.2.24- 
1+0~20191026.31+debian9~1.gbpbbacde+wmf1\",\"url\":\"/wiki/Main_Page\",\"ip\":\"10.64.0.46\",\"http_method\":\"GET\",\"server\":\"en.wikipedia.org\",\"referrer\":null,\"unique_id\":\"XfJfWwpAAC4AAF4 - 
WLUAAAAG\",\"wiki\":\"enwiki\",\"mwversion\":\"1.35.0-wmf.8\",\"reqId\":\"XfJfWwpAAC4AAF4-WLUAAAAG\",\"class\":\"RedisBagOStuff\"}\n", 624, 0, {sa_family=AF_INET, sin_port=htons(10514), sin_addr=inet_addr("127.0.0.1")}, 16) = 624

<snip>

1576165211.884218 sendto(6<socket:[36083338]>, "<15>Dec 12 15:40:11 mwdebug1002 mediawiki: @cee: {\"@timestamp\":\"2019-12-12T15:40:11.884049+00:00\",\"@version\":1,\"host\":\"mwdebug1002\",\"message\":\"MainWANObjectCache using store MemcachedPeclBagOStuff\",\"type\":\"mediawiki\",\"channel\":\"objectcache\",\"level\":\"DEBUG\",\"shard\":\"s1\",\"normalized_message\":\"MainWANObjectCache using store {class}\",\"phpversion\":\"7.2.24-1+0~20191026.31+debian9~1.gbpbbacde+wmf1\",\"url\":\"/wiki/Main_Page\",\"ip\":\"10.64.0.46\",\"http_method\":\"GET\",\"server\":\"en.wikipedia.org\",\"referrer\":null,\"unique_id\":\"XfJfWwpAAC4AAF4-WLUAAAAG\",\"wiki\":\"enwiki\",\"mwversion\":\"1.35.0-wmf.8\",\"reqId\":\"XfJfWwpAAC4AAF4-WLUAAAAG\",\"class\":\"MemcachedPeclBagOStuff\"}\n", 652, 0, {sa_family=AF_INET, sin_port=htons(10514), sin_addr=inet_addr("127.0.0.1")}, 16) = 652

last log streamed using this socket :

1576165212.057219 sendto(6<socket:[36083338]>, "<15>Dec 12 15:40:12 mwdebug1002 mediawiki: @cee: {\"@timestamp\":\"2019-12-12T15:40:12.057093+00:00\",\"@version\":1,\"host\":\"mwdebug1002\",\"message\":\"Saving all sessions on shutdown\",\"type\":\"mediawiki\",\"channel\":\"session\",\"level\":\"DEBUG\",\"shard\":\"s1\",\"normalized_message\":\"Saving all sessions on shutdown\",\"phpversion\":\"7.2.24-1+0~20191026.31+debian9~1.gbpbbacde+wmf1\",\"url\":\"/wiki/Main_Page\",\"ip\":\"10.64.0.46\",\"http_method\":\"GET\",\"server\":\"en.wikipedia.org\",\"referrer\":null,\"unique_id\":\"XfJfWwpAAC4AAF4-WLUAAAAG\",\"wiki\":\"enwiki\",\"mwversion\":\"1.35.0-wmf.8\",\"reqId\":\"XfJfWwpAAC4AAF4-WLUAAAAG\"}\n", 586, 0, {sa_family=AF_INET, sin_port=htons(10514), sin_addr=inet_addr("127.0.0.1")}, 16) = 586

also streamed to mwlog1001:

 1576165212.058534 sendto(7<socket:[36083339]>, "XWikimediaDebug 2019-12-12 15:40:12 [XfJfWwpAAC4AAF4-WLUAAAAG] mwdebug1002 enwiki 1.35.0-wmf.8 session DEBUG: Saving all sessions on shutdown  \n", 144, 0, {
sa_family=AF_INET, sin_port=htons(8420), sin_addr=inet_addr("10.64.32.175")}, 16) = 144

closing both sockets:

1576165212.058882 close(7<socket:[36083339]>) = 0
1576165212.058941 close(6<socket:[36083338]>) = 0

and right after that, it raises a fatal error:

1576165212.059214 open("/etc/php/php7-fatal-error.php", O_RDONLY) = 4</etc/php/php7-fatal-error.php> 
1576165212.059278 fstat(4</etc/php/php7-fatal-error.php>, {st_mode=S_IFREG|0444, st_size=4775, ...}) = 0

which it logs

1576165212.060176 sendto(4<socket:[36083385]>, "MediaWiki.errors.fatal:1|c", 26, 0, {sa_family=AF_INET, sin_port=htons(8125), sin_addr=inet_addr("10.64.16.149")}, 16) = 26
1576165212.060351 sendto(5<socket:[35123055]>, "<27>Dec 12 15:40:12 php7.2-fpm: @cee: {\"exception\":{\"message\":\"Uncaught LogicException: The UdpSocket to 127.0.0.1:10514 has been closed and can not be written to anymore in /srv/mediawiki/php-1.35.0-wmf.8/vendor/monolog/monolog/src/Monolog"..., 1542, MSG_NOSIGNAL, NULL, 0) = 1542

The log that php tries to stream (on mwdebug1002) but can't is:

<15>Dec 12 15:40:12 mwdebug1002 mediawiki: @cee: {"@timestamp":"2019-12-12T15:40:12.059028+00:00","@version":1,"host":"mwdebug1002","message":"Wikimedia\\Rdbms\\LBFactory::destroy: closing connection to database '10.64.32.76'.","type":"mediawiki","channel":"DBConnection","level":"DEBUG","shard":"s1","normalized_message":"Wikimedia\\Rdbms\\LBFactory::destroy: closing connection to database '10.64.32.76'.","phpversion":"7.2.24-1+0~20191026.31+debian9~1.gbpbbacde+wmf1","url":"/wiki/Main_Page","ip":"10.64.0.46","http_method":"GET","server":"en.wikipedia.org","referrer":null,"unique_id":"XfJfWwpAAC4AAF4-WLUAAAAG","wiki":"enwiki","mwversion":"1.35.0-wmf.8","reqId":"XfJfWwpAAC4AAF4-WLUAAAAG"}

and on mwdebug1001:

<15>Dec 12 16:35:23 mwdebug1001 mediawiki: @cee: {\"@timestamp\":\"2019-12-12T16:35:23.662606+00:00\",\"@version\":1,\"host\":\"mwdebug1001\",\"message\":\"Wikimedia\\\\Rdbms\\\\LBFactory::destroy: closing connection to database '10.64.16.12'.\",\"type\":\"mediawiki\",\"channel\":\"DBConnection\",\"level\":\"DEBUG\",\"shard\":\"s1\",\"normalized_message\":\"Wikimedia\\\\Rdbms\\\\LBFactory::destroy: closing connection to database '10.64.16.12'.\",\"phpversion\":\"7.2.24-1+0~20191026.31+debian9~1.gbpbbacde+wmf1\",\"url\":\"/w/load.php?lang=en&modules=startup&only=scripts&raw=1&skin=vector\",\"ip\":\"10.64.32.123\",\"http_method\":\"GET\",\"server\":\"en.wikipedia.org\",\"referrer\":\"https://en.wikipedia.org/wiki/Main_Page\",\"unique_id\":\"XfJsSgpAIHsAAHVV0v4AAAAH\",\"wiki\":\"enwiki\",\"mwversion\":\"1.35.0-wmf.8\",\"reqId\":\"XfJsSgpAIHsAAHVV0v4AAAAH\"}

I am not sure how to continue from here. It appears that both hosts have similar behaviour. Just to be clear, I am quite positive that all logs right before the fatal error, were properly streamed to logstash.

Let me know what you think and how we should proceed

It appears to me that we try to send something on that udp socket with send() after we called close(); This only happens on the debug servers? If so, various things are peculiar there:

  • we install the tideways extension
  • we allow profiling to be active

I would suggest you try to remove tideways from one of the servers to exclude that's what is causing this.

CDanis renamed this task from PHP Fatal error: The UdpSocket to 127.0.0.1:10514 has been closed (from Monolog/SyslogUdp on mwdebug1002) to All debug hosts give (likely spurious) message: PHP Fatal error: The UdpSocket to 127.0.0.1:10514 has been closed (from Monolog/SyslogUdp).Dec 16 2019, 10:00 PM

Change 559088 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] Revert "mwdebug1002: add motd warning"

https://gerrit.wikimedia.org/r/559088

Change 559088 merged by CDanis:
[operations/puppet@production] Revert "mwdebug1002: add motd warning"

https://gerrit.wikimedia.org/r/559088

I also see the UdpSocket warning consistently in the Beta Cluster when using Verbose log.

Debug-only, not a prod failure. Tagging as devprod given the noise on mwdebug is scaring people when staging commits before deployment.

Krinkle renamed this task from All debug hosts give (likely spurious) message: PHP Fatal error: The UdpSocket to 127.0.0.1:10514 has been closed (from Monolog/SyslogUdp) to Debug hosts sometimes Fatal error: "The UdpSocket to 127.0.0.1:10514 has been closed".Jul 13 2021, 12:40 AM
Krinkle added a subscriber: tstarling.

Spotted this on labweb1002 / lab1001 today, all messages referred to 127.0.0.1:10514. A few message emitted at 05:30:02, 06:07:12-14 on both server + a few others here and there. The code comes from:

vendor/monolog/monolog/src/Monolog/Handler/SyslogUdp/UdpSocket.php
55     protected function send(string $chunk): void
56     {
57         if (!is_resource($this->socket) && !$this->socket instanceof Socket) {
58             throw new \RuntimeException('The UdpSocket to '.$this->ip.':'.$this->port.' has been closed and can not be written to anymore');
59         }
60         socket_sendto($this->socket, $chunk, strlen($chunk), $flags = 0, $this->ip, $this->port);
61     }

Maybe the socket has been garbage collected by the OS due to running out of client ports.

I have noticed \\Monolog\Handler\Handler has a __sleep():

vendor/monolog/monolog/src/Monolog/Handler/Handler.php
public function __sleep()
{
    $this->close();

    return array_keys(get_object_vars($this));
}

Thus if the logger is somehow serialized that would close the socket and we cant write logs anymore.

Looking at a couple requests in Kibana the last log message logged before the error is LBFactory shutdown completed. So maybe something attempts to serialize a Monolog Handler which close the socket and thus break logging after that? Does not explain why it is intermittent though. Or maybe we never do any logging after the LBFactory has shutdown unless we have debug logging enabled which cause Wikimedia\\Rdbms\\LBFactory::destroy to attempt a log. That might be the case on the mwdebug and labweb hosts.

That was my fringe theory of the day.

Removing serviceops as this is not actually a production issue and is likely a code bug.

@tstarling wrote a good summary of the issue at T285823:

[…] Probably a message is logged after \Monolog\Handler\SyslogUdp\UdpSocket::close() is called. Perhaps Monolog\Handler\Handler::__destruct() called close(), but the reference is still live, then another destructor attempted to log a message. […]

Whenever X-Wikimedia-Debug: log is used, execution terminates early and an unknown amount of shutdown code is not executed.

The key point being that we're not just losing messages from post-send, but also not executing some or all of the post-send and shutdown steps on these debug requests as any call to logger would result in a fatal. For DeferredUpdates we do recover from this, but it would of course be very spotty to get anything through if any logging call stops that particular stack.

My hunch is that some code is (in)directly calling logger->debug() from a __destruct() method. This would be paradoxical during the "final" shutdown. Even if we allowed Monolog to re-activate a closed connection, we have to stop at some point.

Krinkle renamed this task from Debug hosts sometimes Fatal error: "The UdpSocket to 127.0.0.1:10514 has been closed" to Post-send work someimes fatals with "Errro: The UdpSocket to 127.0.0.1:10514 has been closed" (esp mwdebug hosts).Jul 21 2022, 7:30 PM
Krinkle renamed this task from Post-send work someimes fatals with "Errro: The UdpSocket to 127.0.0.1:10514 has been closed" (esp mwdebug hosts) to Post-send work sometimes fatals with "Errro: The UdpSocket to 127.0.0.1:10514 has been closed" (esp mwdebug hosts).
Krinkle renamed this task from Post-send work sometimes fatals with "Errro: The UdpSocket to 127.0.0.1:10514 has been closed" (esp mwdebug hosts) to Post-send work sometimes fatals at "Error: The UdpSocket to 127.0.0.1:10514 has been closed" (esp. mwdebug hosts).
Krinkle added subscribers: brennen, thcipriani.

My hunch is that some code is (in)directly calling logger->debug() from a __destruct() method. This would be paradoxical during the "final" shutdown. Even if we allowed Monolog to re-activate a closed connection, we have to stop at some point.

I ran /srv/mediawiki.php$ ack -A10 __destruct | fgrep -C20 logger on mwmaint1002 and audited the matches, but did not find much. Most of them were false positives, or things I found unlikely to happen on every web request (such as Shell\Command::__destruct).

Important context for the next bit is that this fatal error is coming from the Monolog's connection to rsyslog, that MediaWiki uses to.... log messages to Logstash. This connection is found to be closed and thus the message can't be sent, which is what the fatal error is for. But... how do we get to see that fatal in Logstash? This is thanks to php-wmerrors, but it has a limited amount of memory to work with and so the stack trace is cut off, which is why we don't know the source of the message.

I applied an ad-hoc edit on mwdebug1001 in order to get a longer trace within the limited amount of space that our last-resort syslog-level error logger permits.

  namespace Monolog\Handler\SyslogUdp;
  class UdpSocket {
     protected function send(string $chunk): void {
-         throw new \RuntimeException('The UdpSocket to '.$this->ip.':'.$this->port.' has been closed and can not be written to anymore');
+         throw new \RuntimeException('The UdpSocket to '.$this->ip.':'.$this->port.' has been closed and can not be written to anymore' . wfGetAllCallers(20));

This produces the following in Logstash when browsing https://test2.wikipedia.org/w/index.php?title=Main_Page&action=history on mwdebug1001 with WikimediaDebug and Verbose logging enabled.

Uncaught RuntimeException: The UdpSocket to 127.0.0.1:10514 has been closed and can not be written to anymore

Wikimedia\Rdbms\LoadBalancer->__destruct/
Wikimedia\Rdbms\LoadBalancer->disable/
Wikimedia\Rdbms\LoadBalancer->closeAll/
Monolog\Logger->debug/
Monolog\Logger->addRecord/
Monolog\Handler\AbstractProcessingHandler->handle/
Monolog\Handler\SyslogUdpHandler->write/
Monolog\Handler\SyslogUdp\UdpSocket->write/
Monolog\Handler\SyslogUdp\UdpSocket->send
..

Change 816042 had a related patch set uploaded (by Krinkle; author: Krinkle):

[mediawiki/core@master] rdbms: Remove LBFactory destructor

https://gerrit.wikimedia.org/r/816042

Change 816042 merged by jenkins-bot:

[mediawiki/core@master] rdbms: Remove LBFactory destructor

https://gerrit.wikimedia.org/r/816042

jbond added subscribers: Joe, jbond.

Removing serviceops as this is not actually a production issue and is likely a code bug.

also removing SRE

There's a possible related issue in T288624: PHP Warning: curl_multi_remove_handle(): supplied resource is not a valid cURL Multi Handle resource, which @tstarling investigated at Gerrit change 957894, and made an upstream PR:

BufferHandler overrides the parent destructor, but it's possible for
close() to be called from a destructor anyway, late during request
shutdown, if the BufferHandler is put in a GroupHandler.

So, avoid flushing the logs twice by ignoring the second call to
close().