WDQS lag detection required manual adjustment during DC switchover
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Legoktm
	Jun 28 2021, 5:50 PM

Description

During today's DC switchover, the WDQS lag checking/monitoring required manual adjustment: https://gerrit.wikimedia.org/r/701927

I'm marking this as high priority because it ended up causing user impact as it affected bots that check the lag before editing.

On IRC @Gehel said that the long-term fix for this is T244590: [Epic] Rework the WDQS updater as an event driven application. How long-term are we looking at? Will that be in place for the switch back in ~1 month? Or do we also need a short term solution?

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		dcausse	T285710 WDQS lag detection required manual adjustment during DC switchover
		Resolved		Gehel	T288231 Deploy the wdqs streaming updater to production

Event Timeline

Legoktm triaged this task as High priority.Jun 28 2021, 5:50 PM

Legoktm created this task.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 28 2021, 5:50 PM

Legoktm added a project: Datacenter-Switchover.Jun 28 2021, 5:51 PM

Addshore added projects: Wikidata, [DEPRECATED] wdwb-tech.Jun 28 2021, 5:52 PM

Addshore moved this task from Inbox to External Realm on the [DEPRECATED] wdwb-tech board.

Addshore moved this task from incoming to monitoring on the Wikidata board.

It is unlikely that the long term fix (T244590) will be in place for the switch back. The simple workaround (but far from ideal) is to revert https://gerrit.wikimedia.org/r/c/operations/puppet/+/701927/ during the switch back. The more complex solution would be to implement better metrics in the current updater, but I doubt we will have time to do that before the switch back.

MPhamWMF moved this task from Incoming to Watching / Waiting on the Wikidata-Query-Service board.Jul 12 2021, 3:29 PM

Legoktm added a parent task: T287539: September 2021 Datacenter switchover (codfw -> eqiad).Aug 4 2021, 9:24 PM

@Gehel what ends up consuming that value? Can we have it read the primary DC from conftool?

For now I've documented this as a manual step: https://wikitech.wikimedia.org/w/index.php?title=Switch_Datacenter&type=revision&diff=1920831&oldid=1920827

In T285710#7261551, @Legoktm wrote:

@Gehel what ends up consuming that value? Can we have it read the primary DC from conftool?

For now I've documented this as a manual step: https://wikitech.wikimedia.org/w/index.php?title=Switch_Datacenter&type=revision&diff=1920831&oldid=1920827

The system is passed a topic name at startup used as a reference for the lag (https://gerrit.wikimedia.org/r/plugins/gitiles/wikidata/query/rdf/+/refs/heads/master/tools/src/main/java/org/wikidata/query/rdf/tool/change/KafkaPoller.java#365). The updater then only reports the timestamp of this topic to blazegraph which is then consumed by a prometheus exporter.
The way this "timestamp" (lag) is determined should be changed to either:

determine dynamically what is the "reportingTopic" calling conftool to know where eventgate is pushing MW events
do what the new updater is doing (T244590): compute an avg of the timestamps instead

Because none of these changes would be trivial I think we prefer to wait for the new system to be in place.

In T285710#7262256, @dcausse wrote:

Because none of these changes would be trivial I think we prefer to wait for the new system to be in place.

Fair enough, thanks for the explanation. Could you add a blocker to this task of whichever one is for deploying the new updater?

dcausse added a subtask: T244590: [Epic] Rework the WDQS updater as an event driven application.Aug 5 2021, 12:15 PM

dcausse edited subtasks, added: T288231: Deploy the wdqs streaming updater to production; removed: T244590: [Epic] Rework the WDQS updater as an event driven application.Aug 5 2021, 12:25 PM

Legoktm removed a parent task: T287539: September 2021 Datacenter switchover (codfw -> eqiad).Sep 10 2021, 12:06 AM

I think this is resolved now that the streaming updater is in use everywhere? https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/message/NXBOCI3WKTZBB6RB2GYWBBH2BFH3NBT6/

In T285710#7441835, @Legoktm wrote:

I think this is resolved now that the streaming updater is in use everywhere? https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/message/NXBOCI3WKTZBB6RB2GYWBBH2BFH3NBT6/

Yes finally! :)

Does that mean the periodic updateQueryServiceLag should be removed? Currently that’s still running on mwmaint1002.

In T285710#7443613, @Lucas_Werkmeister_WMDE wrote:

Does that mean the periodic updateQueryServiceLag should be removed? Currently that’s still running on mwmaint1002.

T221774 was added because WDQS could not keep up with the update rate, the streaming updater was made to make sure WDQS is no longer the bottleneck here and that edits made by bots respecting maxLag are no longer throttled.
I think this that T221774 is indeed no longer needed, my team have plan to have an SLO on the wdqs lag to make sure it's always below 10min (T293027). Is this enough/equivalent to T221774? I've filed T293886 to continue the discussion.

Gehel closed subtask T288231: Deploy the wdqs streaming updater to production as Resolved.Oct 28 2021, 8:09 AM

Legoktm mentioned this in T296699: Pool eventgate-main in both datacenters (active/active).Nov 30 2021, 12:06 AM

WDQS lag detection required manual adjustment during DC switchoverClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

WDQS lag detection required manual adjustment during DC switchover
Closed, ResolvedPublic
Actions

Related Objects
Search...