Page MenuHomePhabricator

WDQS lag detection required manual adjustment during DC switchover
Closed, ResolvedPublic

Description

During today's DC switchover, the WDQS lag checking/monitoring required manual adjustment: https://gerrit.wikimedia.org/r/701927

I'm marking this as high priority because it ended up causing user impact as it affected bots that check the lag before editing.

On IRC @Gehel said that the long-term fix for this is T244590: [Epic] Rework the WDQS updater as an event driven application. How long-term are we looking at? Will that be in place for the switch back in ~1 month? Or do we also need a short term solution?

Event Timeline

Legoktm triaged this task as High priority.Jun 28 2021, 5:50 PM
Legoktm created this task.

It is unlikely that the long term fix (T244590) will be in place for the switch back. The simple workaround (but far from ideal) is to revert https://gerrit.wikimedia.org/r/c/operations/puppet/+/701927/ during the switch back. The more complex solution would be to implement better metrics in the current updater, but I doubt we will have time to do that before the switch back.

@Gehel what ends up consuming that value? Can we have it read the primary DC from conftool?

For now I've documented this as a manual step: https://wikitech.wikimedia.org/w/index.php?title=Switch_Datacenter&type=revision&diff=1920831&oldid=1920827

@Gehel what ends up consuming that value? Can we have it read the primary DC from conftool?

For now I've documented this as a manual step: https://wikitech.wikimedia.org/w/index.php?title=Switch_Datacenter&type=revision&diff=1920831&oldid=1920827

The system is passed a topic name at startup used as a reference for the lag (https://gerrit.wikimedia.org/r/plugins/gitiles/wikidata/query/rdf/+/refs/heads/master/tools/src/main/java/org/wikidata/query/rdf/tool/change/KafkaPoller.java#365). The updater then only reports the timestamp of this topic to blazegraph which is then consumed by a prometheus exporter.
The way this "timestamp" (lag) is determined should be changed to either:

  • determine dynamically what is the "reportingTopic" calling conftool to know where eventgate is pushing MW events
  • do what the new updater is doing (T244590): compute an avg of the timestamps instead

Because none of these changes would be trivial I think we prefer to wait for the new system to be in place.

Because none of these changes would be trivial I think we prefer to wait for the new system to be in place.

Fair enough, thanks for the explanation. Could you add a blocker to this task of whichever one is for deploying the new updater?

dcausse claimed this task.

I think this is resolved now that the streaming updater is in use everywhere? https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/message/NXBOCI3WKTZBB6RB2GYWBBH2BFH3NBT6/

Yes finally! :)

Does that mean the periodic updateQueryServiceLag should be removed? Currently that’s still running on mwmaint1002.

Does that mean the periodic updateQueryServiceLag should be removed? Currently that’s still running on mwmaint1002.

T221774 was added because WDQS could not keep up with the update rate, the streaming updater was made to make sure WDQS is no longer the bottleneck here and that edits made by bots respecting maxLag are no longer throttled.
I think this that T221774 is indeed no longer needed, my team have plan to have an SLO on the wdqs lag to make sure it's always below 10min (T293027). Is this enough/equivalent to T221774? I've filed T293886 to continue the discussion.