codfw appserver latency alerts flapping
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	CDanis
	May 26 2021, 4:55 PM

Description

For some number of days now, the appserver latency alerts for codfw have been flapping:

12:31:13	<icinga-wm>	RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
12:35:13	<icinga-wm>	RECOVERY - Elevated latency for icinga checks in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/rsCfQfuZz/icinga

Seems to happen to both appserver and api_appserver clusters.

This isn't really concerning; mostly it is making noise.

The overall number of rps in these clusters is very low -- approx 25 rps for each of api_appserver and appserver -- but they recently seem to be getting more 'slow' queries as compared to historically.

Briefly investigate to see if something is amiss with these queries

Modify the latency alerts to only fire if there is both high latency and query rps above a certain threshold

Related Objects

Mentioned Here: T264209: Run stress tests on docker images infrastructure

Event Timeline

CDanis created this task.May 26 2021, 4:55 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 26 2021, 4:55 PM

Ι think there is a correlation between the latency spikes and TCP retransmits

mw latency (choose mw2361)

TCP errors mw2361

Zooming in on the 26th of May

• Marostegui triaged this task as Medium priority.May 27 2021, 5:13 AM

• Marostegui removed a project: SRE.

These times correlate with me doing docker-registry stress tests. My tests roughly ran in times:

Mi 19. Mai 12:55:54 UTC 2021 -> Mi 19. Mai 16:27:06 UTC 2021
Do 20. Mai 07:27:26 UTC 2021 -> Do 20. Mai 16:18:17 UTC 2021
Fr 21. Mai 12:42:03 UTC 2021 -> Fr 21. Mai 21:07:13 UTC 2021
Sa 22. Mai 10:32:56 UTC 2021 -> Sa 22. Mai 10:56:26 UTC 2021
Mi 26. Mai 13:08:25 UTC 2021 -> Mi 26. Mai 18:31:25 UTC 2021

@JMeybohm can you add more info about those tests? Are those made from codfw to the docker-registry? (trying to understand how they could fit in the picture)

In T283744#7118600, @elukey wrote:

@JMeybohm can you add more info about those tests? Are those made from codfw to the docker-registry? (trying to understand how they could fit in the picture)

Sure, sorry:
I'm pulling a docker-image (~1.6GB in total size) from 1 to 19 codfw kubernetes nodes in parallel (from the registries in codfw). In the first four timeframes I was doing so without any caching on the registries, which means images where proxied directly from swift. The last test (yesterday) was with a local nginx cache enabled. See T264209 and https://wikitech.wikimedia.org/wiki/User:JMeybohm/Docker-Registry-Stresstest for reference.

Unless @JMeybohm was not the culprit, we can mark this as resolved

	F34470373: image.png
	May 26 2021, 9:38 PM

	F34470370: image.png
	May 26 2021, 9:37 PM

	F34470356: image.png
	May 26 2021, 9:33 PM

	F34470364: image.png
	May 26 2021, 9:33 PM

codfw appserver latency alerts flappingClosed, ResolvedPublicActions

Description

Related Objects

Event Timeline

codfw appserver latency alerts flapping
Closed, ResolvedPublic
Actions