Page MenuHomePhabricator

codfw appserver latency alerts flapping
Closed, ResolvedPublic

Assigned To
Authored By
CDanis
May 26 2021, 4:55 PM
Referenced Files
F34470373: image.png
May 26 2021, 9:38 PM
F34470370: image.png
May 26 2021, 9:37 PM
F34470356: image.png
May 26 2021, 9:33 PM
F34470364: image.png
May 26 2021, 9:33 PM
F34470353: image.png
May 26 2021, 9:33 PM
F34470362: image.png
May 26 2021, 9:33 PM
F34469923: image.png
May 26 2021, 4:55 PM

Description

For some number of days now, the appserver latency alerts for codfw have been flapping:

12:31:13	<icinga-wm>	RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
12:35:13	<icinga-wm>	RECOVERY - Elevated latency for icinga checks in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/rsCfQfuZz/icinga

Seems to happen to both appserver and api_appserver clusters.

This isn't really concerning; mostly it is making noise.

The overall number of rps in these clusters is very low -- approx 25 rps for each of api_appserver and appserver -- but they recently seem to be getting more 'slow' queries as compared to historically.

image.png (306×511 px, 26 KB)

  • Briefly investigate to see if something is amiss with these queries
  • Modify the latency alerts to only fire if there is both high latency and query rps above a certain threshold

Event Timeline

Ι think there is a correlation between the latency spikes and TCP retransmits

mw latency (choose mw2361)

image.png (1×2 px, 333 KB)

TCP errors mw2361
image.png (1×3 px, 361 KB)

Zooming in on the 26th of May

image.png (1×3 px, 505 KB)

image.png (1×2 px, 293 KB)

Marostegui removed a project: SRE.

These times correlate with me doing docker-registry stress tests. My tests roughly ran in times:

  • Mi 19. Mai 12:55:54 UTC 2021 -> Mi 19. Mai 16:27:06 UTC 2021
  • Do 20. Mai 07:27:26 UTC 2021 -> Do 20. Mai 16:18:17 UTC 2021
  • Fr 21. Mai 12:42:03 UTC 2021 -> Fr 21. Mai 21:07:13 UTC 2021
  • Sa 22. Mai 10:32:56 UTC 2021 -> Sa 22. Mai 10:56:26 UTC 2021
  • Mi 26. Mai 13:08:25 UTC 2021 -> Mi 26. Mai 18:31:25 UTC 2021

@JMeybohm can you add more info about those tests? Are those made from codfw to the docker-registry? (trying to understand how they could fit in the picture)

@JMeybohm can you add more info about those tests? Are those made from codfw to the docker-registry? (trying to understand how they could fit in the picture)

Sure, sorry:
I'm pulling a docker-image (~1.6GB in total size) from 1 to 19 codfw kubernetes nodes in parallel (from the registries in codfw). In the first four timeframes I was doing so without any caching on the registries, which means images where proxied directly from swift. The last test (yesterday) was with a local nginx cache enabled. See T264209 and https://wikitech.wikimedia.org/wiki/User:JMeybohm/Docker-Registry-Stresstest for reference.

jijiki claimed this task.

Unless @JMeybohm was not the culprit, we can mark this as resolved