Database 'INSERT' query rate doubled (module_deps regression?)
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Krinkle
	Mar 5 2020, 8:23 PM

Description

db1083 (s1 master, enwiki) 2020-03-05

Grafana: MySQL / Write stats	Grafana: MySQL / Traffic

db1122 (s2 master, e.g. zhwiki, nlwiki, ..) 2020-03-05

Screenshot 2020-05-18 at 23.17.03.png (1×2 px, 535 KB)

According to @jcrespo, these are largely inserts to module_deps table (about 150 per second as of 6 hours ago):

INSERT /* Wikimedia\DependencyStore\SqlModuleDependencyStore::storeMulti  */ INTO `module_deps`

20:11 <jynus> not a huge fan of the ON DUPLICATE KEY UPDATE which I think it is what it is hiting instead of inserting

Details

Subject	Repo	Branch	Lines +/-
resourceloader: Make DepStore write lock specific to the current wiki	mediawiki/core	master	+5 -2
resourceloader: make SqlModuleDependencyStore avoid issuing no-op upserts	mediawiki/core	wmf/1.35.0-wmf.32	+64 -41
resourceloader: make SqlModuleDependencyStore avoid issuing no-op upserts	mediawiki/core	master	+64 -41
resourceloader: make SqlModuleDependencyStore use CONN_TRX_AUTO	mediawiki/core	wmf/1.35.0-wmf.32	+20 -3
resourceloader: make SqlModuleDependencyStore use CONN_TRX_AUTO	mediawiki/core	master	+20 -3

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	aaron	T88445 MediaWiki active/active datacenter investigation and work (tracking)
Resolved	Krinkle	T270223 FY2021-2022: Enable basic Multi-DC operations for read traffic (tracking)
Resolved	aaron	T252951 ResourceLoader DepStore lock acquired twice?
Resolved	Krinkle	T113916 Switch ResourceLoader file dependency tracking to MultiDC-friendly backend
Resolved	aaron	T247028 Database 'INSERT' query rate doubled (module_deps regression?)

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Restricted Application added a project: Performance-Team. · View Herald TranscriptMar 5 2020, 8:23 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

This changed recently by these two commits for T113916. However afaik the queries were meant to be identical, so this change, if indeed caused by this, is likely unintended.

Krinkle assigned this task to aaron.Mar 5 2020, 8:24 PM

jcrespo added a subscriber: • Marostegui.Mar 5 2020, 8:31 PM

This went to normal values, at least in total rate (I haven't checked the kind of queries being sent):

Screenshot_20200306_075551.png (900×2 px, 258 KB)

I wonder if some kind of cache invalidation causes this (I was told something about memcache integration or something)- and if it will happen again. I am a bit ignorant about this functionality, but if this is right, I would consider this rate of writes as a an avalanche, and it would be nice if it could be either be pre-hit/heated or smoother more over time, but I really lack context to say something concrete.

Not a high priority, but the change in pattern of queries was so radical, that I was sure this was going to cause an outage (it didn't). The rate may have only caused more predisposition to lag.

There was a similar rise in INSERTs (with similar queries involved) the 27-28 feb, however that didn't rise any alarms because the top rate was lower (around 50% compared to the last one).

• Gilles moved this task from Inbox, needs triage to Doing (old) on the Performance-Team board.Mar 10 2020, 5:28 PM

Krinkle moved this task from Inbox to Confirmed Problem on the MediaWiki-ResourceLoader board.Mar 16 2020, 9:28 PM

https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&fullscreen=true&panelId=2&from=1582405443632&to=1586876803912&var-dc=eqiad%20prometheus%2Fops&var-server=db1083&var-port=9104

Looks like the slams still happen each week. I've installed the "Train deploys" annotation on this dashboard and indeed it correlates with it. (Each week typically has 3-4 annotations, where the last one is group2 / all wikis).

Last 90 days

And before March, the slams weren't there, which also further suggests it was a regression, possibly due to the DepStore layer having perhaps lost or broken part of the logic.

Last 6 months

Knowing where these come from, and not causing any visible production outage (even if numbers don't look great), while I would not close this ticket, but I would lower the priority so this is a known issue (inefficient storage of the relevant functionality as a regression, which would be nice to fix).

Note that extra inserts, because of our setup, get multiplied by the number of replicas on each replica set, so unnecessary extra writes stack among servers.

Also note the one on the 5Mar was the tallest spike, subsequent ones, while not ideal, are less alarming.

This is the handler views, useful because it proves that not only statements increase, rows affected (updated or inserted) too (e.g. not SQL code with no actual row writes performed):

Krinkle triaged this task as Medium priority.Apr 18 2020, 11:15 PM

Krinkle added a parent task: T113916: Switch ResourceLoader file dependency tracking to MultiDC-friendly backend.Apr 21 2020, 5:15 PM

This happened again after yesterday's release and it is causing lag on some slaves.
The inserts doubled everywhere.

This is s1 (enwiki master)

Captura de pantalla 2020-05-12 a las 10.25.30.png (744×1 px, 207 KB)

And s6 (frwiki, jawiki and ruwiki) master:

Captura de pantalla 2020-05-12 a las 10.26.16.png (721×1 px, 221 KB)

This needs to get fixed as it is now causing lag on certain slaves.

I have confirmed (by looking at binlogs) that it is the same original INSERT as described when the task was firstly created:

INSERT /* Wikimedia\DependencyStore\SqlModuleDependencyStore::storeMulti  */ INTO `module_deps` (md_module,md_skin,md_deps) VALUES ('skins.vector.styles.legacy','vector|ru','[\"../php-1.35.0-wmf.30/resources/src/mediawiki.less/mediawiki.mixins.less\",\"../php-1.35.0-wmf.30/resources/src/mediawiki.skinning/i18n-all-lists-margins.less\",\"../php-1.35.0-wmf.30/resources/src/mediawiki.skinning/i18n-headings.less\",\"../php-1.35.0-wmf.

• Marostegui mentioned this in T249963: 1.35.0-wmf.31 deployment blockers.May 12 2020, 8:30 AM

DannyS712 subscribed.May 12 2020, 9:38 AM

• Mholloway subscribed.May 12 2020, 1:47 PM

Jdforrester-WMF raised the priority of this task from High to Unbreak Now!.May 12 2020, 3:24 PM

Jdforrester-WMF added a parent task: T249964: 1.35.0-wmf.32 deployment blockers.

Jdforrester-WMF subscribed.

Restricted Application added a subscriber: Liuxinyu970226. · View Herald TranscriptMay 12 2020, 3:24 PM

thcipriani added a parent task: T249963: 1.35.0-wmf.31 deployment blockers.May 12 2020, 3:31 PM

thcipriani removed a parent task: T249963: 1.35.0-wmf.31 deployment blockers.May 12 2020, 3:34 PM

To recap the state before manuel's update- this happened before on a previous deploy, but we determined with performance team that it was "bad performance/undesired behaviour" but not "high priority"- as then it didn't cause user impact.

However, this last time is causing database lag and real production impact on a couple of servers of the s6 section (inserts seen on all sections), and quite some error spam on database errors (webrequests and jobqueue). :-(

If the root cause is not easy to fix, maybe concurrency can be limited somehow to prevent the lag?

@jcrespo It already has a concurrency limit of 1 update for any given module/wiki/language combination, the others don't get a lock and rely on whoever got the lost first to do the write. The only thing we could do more is turn off production wikis or let the train reach fewer wikis at once.

I would not describe this as difficult to fix, apart from that it is difficult to figure out why it changed in the first place. No requirements or inputs have (intentionally) been changed. So the behaviour we are seeing is definitely not understood or intended. This used to work fine.

Krinkle added a project: Sustainability (Incident Followup).May 12 2020, 4:35 PM

So the behaviour we are seeing is definitely not understood or intended. This used to work fine.

Strange indeed, let me know -to manuel or me- if we can help debugging. Through the binlogs we can dump the whole write workflow.

@Marostegui noticed a similar spike last week and we thought about filling an UBN. But it got dismissed since it happened the week before and it has been considered normal behavior.

Is there some kind of cache being invalidated constantly because it is shared by two different wiki versions?

I am the train conductor this week. Should I just rollback 1.35.0-wmf.32 which is currently solely on the testwikis or would it be fine to promote to all of group0? ;)

The problematic load only happens on group2 wikis due to the scale and diversity of requests, right? If so, we can at least go to group0 for now.

Looking at the graph and sal:

Since	Description	Until	Description
2020-04-16 18:45	15 mins before: all wikis to 1.35.0-wmf.28	2020-04-17 ~ 18:50	??
2020-04-30 12:00	later at 13:04: all wikis to wmf.30	2020-05-01 ~ 14:00	???
2020-05-04 16:00	all wikis to wmf.30	2020-05-05 17:00	17:35 testswikis to wmf.31
2020-05-07 16:00	???	2020-05-08 00:30	Revert all but testwiki to wmf.30
2020-05-11 17:00	all wikis to wmf.31	2020-05-12 17:00	???

At least the two last events have some interest. Reverting wmf.30 from all to testwiki stopped the spam of insert. Yesterday the spam started with wmf.31 being send to all wikis and stopped after 24 hours.

I am not sure how those numbers are any conclusive though. Might be related to a 24 hours ttl somewhere?

Given it has stopped at 17:00, I will process with promoting group 0

I have pushed wmf.32 to group0 at 19:05 UTC. That has not changed the rate of INSERT.

I have pushed wmf.32 to group1 at 19:09 UTC and nothing changed.

Removing from train blockers for now.

hashar removed a parent task: T249964: 1.35.0-wmf.32 deployment blockers.May 13 2020, 7:20 PM

In T247028#6130019, @jcrespo wrote:

However, this last time is causing database lag and real production impact on a couple of servers of the s6 section (inserts seen on all sections), and quite some error spam on database errors (webrequests and jobqueue). :-(

I don't know why we went to group1 with this warning, but please do not proceed to group2.

Krinkle added a parent task: T249964: 1.35.0-wmf.32 deployment blockers.May 13 2020, 11:35 PM

Kormat subscribed.May 14 2020, 11:17 AM

I went with group 1 since:

it seems to be triggered by promoting the rest of the wiki (I guess enwiki?)
that has been going on for a few months now
it also happens outside of the train window

I have watched the rate after the group 1 promotion and would have rolled back if the rate had raised.

Note that the rate of INSERT has been raising slowly for the last 12 hours and is at roughly 130 qps right now for db1083.

Background. ResourceLoader stores a dependency tree of module files in the module_deps table. For reasons explained in T113916, we're moving this data away from the main databases to avoid DB master writes from GET requests.

The first step toward that, was to move the DB query logic to a separate class, which we could then swap with a different implementation later on. That step was deployed last month and wasn't expected to change behaviour.

We did notice a small increase in INSERTs, but this was considered insignificant. Note that the writes are a one-time thing after a deployment and their number closely relates to how many modules were changed by developers the week prior, as well as when users first request them on any given wiki. It varies from week to week.

By now, we know the change is likely significant and in looking for a probable cause, the only thing I could find is that was different is that we now perform the INSERT queries post-send rather than mid-request. It's possible that more concurrent initial cache misses requests see the stale data on the replica as such, and try their INSERT. The bulk of these are prevented given the (non-blocking) Memc lock we have in place for that. But the overlap may not be perfect, and likely more variance is seen when deferring the write to happen post-send.

The reason there are concurrent HTTP cache misses by the way, likely has to do with the Varnish-to-ATS migration. Until Nov 2019, the traffic was routed like this: esams:front-cache → esams:back-cache → eqiad:back-cache → eqiad:mw-app, it is now esams:front-cache → esams:back-cache → eqiad:mw-app. This means request coalescing is now distributed allowing upto 5 rather than just one at a time. 5 isn't much, but it is still 5x more than 1 - which for N wikis x N modules x N languages/skins does add up.

Next steps. If we have no other options, we could try to back out the merging and post-send logic. However, I'd recommend we first try to simply disable the module_deps system more generally in favour of the simpler KeyValue mode. Let's evaluate on Monday and determine then whether to roll out KeyValue further, or whether to pay the short-term cost of carefully undoing the post-send refactoring.

Change 596739 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] resourceloader: make SqlModuleDependencyStore use CONN_TRX_AUTO

https://gerrit.wikimedia.org/r/596739

Change 596740 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] resourceloader: make SqlModuleDependencyStore avoid issuing no-op upserts

https://gerrit.wikimedia.org/r/596740

Nikerabbit subscribed.May 18 2020, 7:50 AM

Change 597128 had a related patch set uploaded (by Krinkle; owner: Aaron Schulz):
[mediawiki/core@wmf/1.35.0-wmf.32] resourceloader: make SqlModuleDependencyStore use CONN_TRX_AUTO

https://gerrit.wikimedia.org/r/597128

Change 596739 merged by jenkins-bot:
[mediawiki/core@master] resourceloader: make SqlModuleDependencyStore use CONN_TRX_AUTO

https://gerrit.wikimedia.org/r/596739

Change 596740 merged by jenkins-bot:
[mediawiki/core@master] resourceloader: make SqlModuleDependencyStore avoid issuing no-op upserts

https://gerrit.wikimedia.org/r/596740

ReleaseTaggerBot added a project: MW-1.35-notes (1.35.0-wmf.34; 2020-05-26).May 18 2020, 7:00 PM

Change 597128 merged by jenkins-bot:
[mediawiki/core@wmf/1.35.0-wmf.32] resourceloader: make SqlModuleDependencyStore use CONN_TRX_AUTO

https://gerrit.wikimedia.org/r/597128

Change 597161 had a related patch set uploaded (by Krinkle; owner: Aaron Schulz):
[mediawiki/core@wmf/1.35.0-wmf.32] resourceloader: make SqlModuleDependencyStore avoid issuing no-op upserts

https://gerrit.wikimedia.org/r/597161

ReleaseTaggerBot edited projects, added MW-1.35-notes (1.35.0-wmf.32; 2020-05-12); removed MW-1.35-notes (1.35.0-wmf.34; 2020-05-26).May 18 2020, 9:00 PM

Change 597161 merged by jenkins-bot:
[mediawiki/core@wmf/1.35.0-wmf.32] resourceloader: make SqlModuleDependencyStore avoid issuing no-op upserts

https://gerrit.wikimedia.org/r/597161

Maintenance_bot removed a project: Patch-For-Review.May 18 2020, 9:10 PM

Mentioned in SAL (#wikimedia-operations) [2020-05-18T22:02:38Z] <Krinkle> Clear module_deps on hewiki (group1, s7) to monitor regeneration, ref T247028

Krinkle updated the task description. (Show Details)May 18 2020, 10:17 PM

Restricted Application added a subscriber: Stang. · View Herald TranscriptMay 18 2020, 10:17 PM

Change 597171 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/core@master] resourceloader: Make DepStore write lock specific to the current wiki

https://gerrit.wikimedia.org/r/597171

gerritbot added a project: Patch-For-Review.May 18 2020, 10:54 PM

Considering the train unblocked for now. We've rolled out two patches to minimise the impact of the current (legacy) db-writing model. This should bring the inserts back to where they were, or lower.

Krinkle removed a parent task: T249964: 1.35.0-wmf.32 deployment blockers.May 18 2020, 11:04 PM

Change 597171 merged by jenkins-bot:
[mediawiki/core@master] resourceloader: Make DepStore write lock specific to the current wiki

https://gerrit.wikimedia.org/r/597171

ReleaseTaggerBot edited projects, added MW-1.35-notes (1.35.0-wmf.34; 2020-05-26); removed MW-1.35-notes (1.35.0-wmf.32; 2020-05-12).May 19 2020, 1:00 AM

Maintenance_bot removed a project: Patch-For-Review.May 19 2020, 1:10 AM

Thanks for working on this.
This looks better now - still not at the values we had a month ago. Let's see if those huge spikes come back

Captura de pantalla 2020-05-19 a las 6.36.55.png (827×1 px, 184 KB)

Ladsgroup subscribed.May 19 2020, 10:17 PM

I am aiming at pushing 1.35.0-wmf.32 to all wikis on Monday May 25th ~ 13:00 UTC (european train window).

Monday 25th was a no deploy day and I forgot to update this task. 1.35.0-wmf.32 has been rolled on all remaining wikis today at 13:00 UTC and we have seen the rate of INSERT doubling again.

Manuel and Jaime were around for the monitoring / DB expertise.

hashar mentioned this in T249964: 1.35.0-wmf.32 deployment blockers.May 26 2020, 1:39 PM

RhinosF1 subscribed.May 26 2020, 2:51 PM

So this time we didn't get as much lag as we did last time, there are still some traces of lag on s8 (wikidatawiki) but definitely not as user impacting as we had on the previous issue (T247028#6128426).
However, I think it is pure luck that this time we didn't get lag, as the rate of INSERTs doubled on some masters.
Master's graph

s1- db1083:

Captura de pantalla 2020-05-26 a las 18.11.08.png (725×1 px, 157 KB)

s2 - db1122:

Captura de pantalla 2020-05-26 a las 18.11.53.png (727×1 px, 194 KB)

s3 - db1123:

Captura de pantalla 2020-05-26 a las 18.12.29.png (752×1 px, 172 KB)

s4 - db1138:

Captura de pantalla 2020-05-26 a las 18.13.23.png (717×1 px, 191 KB)

s5 - db1100:

Captura de pantalla 2020-05-26 a las 18.13.53.png (721×1 px, 94 KB)

s6 - db1131:

Captura de pantalla 2020-05-26 a las 18.14.21.png (718×1 px, 90 KB)

s7 - db1086:

Captura de pantalla 2020-05-26 a las 18.14.52.png (722×1 px, 85 KB)

s8 - db1109 (hard to see as it has always a very spiky traffic):

I am not sure how long we can keep deploying with these issues without causing lag because of this sudden INSERTs spikes. Creating lag on the replicas means generating user errors.
We need to try to work on this and reduce the rate of INSERTs even more, otherwise on each train rollout we have high chances of creating lag.

@Marostegui Can you clarify to what extent this is an urgent on-going matter vs something that changed or got worse in 2020?

In any event, the parent task tracks the overall removal of this table (T113916). If we're confident that last weeks patches have brought it back to "normal" rate increases around major deployments, I suggest we close this incident and continue at T113916, where there will be on-going work this/next quarter.

In T247028#6166069, @Krinkle wrote:

@Marostegui Can you clarify to what extent this is an urgent on-going matter vs something that changed or got worse in 2020?

So the issue is that a week ago we started to see lag because of the increase on spikes and that is what is worrying. If all we had were this spikes, even though they are not ideal, they are not causing any harm, but for some reason we saw the slaves getting behind, and that is what causes user impact. The reason is yet to be known but it could be:

maybe more INSERTs.
maybe more load coming from the natural usage and growth
maybe just a one time thing

Or the combination of all of the above.
If this didn't cause lag, we'd be ok with the spikes (obviously keeping in mind that it is not ideal can puts unnecessary overhead on the hosts), but as soon as they cause lag, that becomes an incident, as users get affected.
Maybe this happened just past week and it was just a one time thing, perhaps, but we definitely need to keep an eye on it.
Luckily today we didn't see that during the train (or at least not as much as we did last week), but again: was it just luck? when will it happen again? We definitely need to keep monitoring this closely during the rollouts.
As I said above, if this becomes more and more often, that means that with every rollout we will generate errors. I am afraid of this becoming another: "it is just normal lag - happens sometimes during the release" and that has the risk of hiding underlying code/performance issues specific to that release.

In any event, the parent task tracks the overall removal of this table (T113916). If we're confident that last weeks patches have brought it back to "normal" rate increases around major deployments, I suggest we close this incident and continue at T113916, where there will be on-going work this/next quarter.

Not sure if we should close this, this is still an issue, as the number of INSERTS does double after the release and it takes hours/days to recover and go back to a normal
This is the situation with enwiki master almost 6 hours after the release:

Captura de pantalla 2020-05-26 a las 19.15.44.png (718×1 px, 89 KB)

mark subscribed.May 27 2020, 10:40 AM

mark added a project: Platform Team Workboards (Clinic Duty Team).May 27 2020, 10:43 AM

• mmodell subscribed.May 27 2020, 8:36 PM

@mark We looked this over, and it seems like the right teams are working on this and a lot of hands are on the ball. If there is something specific you need CPT involved in, let us know by tagging us again with the Platform Engineering tag.

Krinkle moved this task from Doing (old) to To-do: Goals prioritized current Quarter on the Performance-Team board.Jul 20 2020, 8:22 PM

What is the state of this now? Are there any query graphs specific to this table?

I don't know, but at least the spikes we used to see each week no longer exist, and this whole thing will be obsolete once we switch to main stash DB over the next 1-2 months.

From task description

2020-03-05

db1083 (s1 master, enwiki) 2020-03-05

Grafana: MySQL / Write stats Grafana: MySQL / Traffic

db1122 (s2 master, e.g. zhwiki, nlwiki, ..) 2020-03-05

2020-11-14

From my point of view, based on @Krinkle's graphs, we can resolve this.

Is it clear what the underlying problem was and how it was fixed? Just making sure it's not phantom recovery.

We've optimised the queries a fair bit. It wasn't a single change that likely did it, or if it was we don't know. We're not putting a lot of effort into this one right now since we're actively investing in moving this feature to a different backend that isn't backed by core DBs, which thus obsoletes concerns in this area.

Krinkle closed this task as Resolved.Nov 16 2020, 8:44 PM

Stang unsubscribed.Nov 13 2021, 8:42 PM

	F31842983: Captura de pantalla 2020-05-26 a las 19.15.44.png
	May 26 2020, 5:17 PM

	F31842876: Captura de pantalla 2020-05-26 a las 18.14.52.png
	May 26 2020, 4:20 PM

	F31842863: Captura de pantalla 2020-05-26 a las 18.11.53.png
	May 26 2020, 4:20 PM

	F31842860: Captura de pantalla 2020-05-26 a las 18.11.08.png
	May 26 2020, 4:20 PM

	F31842874: Captura de pantalla 2020-05-26 a las 18.14.21.png
	May 26 2020, 4:20 PM

	F33914607: a2.png
	Nov 15 2020, 2:32 AM

	F33914606: a1.png
	Nov 15 2020, 2:32 AM

	F33914608: b.png
	Nov 15 2020, 2:32 AM

Database 'INSERT' query rate doubled (module_deps regression?)Closed, ResolvedPublicActions