Hi,
Today we switched over most services and traffic caches from the eqiad
(Virginia) datacenter to codfw (Texas) as part of improving our
reliability. The goal is to have this procedure working and regularly
tested in case of an emergency when we actually need it.
We're only aware of one user-facing impact, for a short time WDQS lag
detection was broken, affecting Wikidata bots that check it. This is
tracked as <https://phabricator.wikimedia.org/T285710>.
Users will experience a bit of a latency increase for now as most user
traffic will need to talk to both eqiad and codfw datacenters. This will
go away tomorrow once MediaWiki is switched over (keep reading).
Also, we were a bit delayed in starting today because of an issue
causing appservers to get stuck:
<https://phabricator.wikimedia.org/T285634>.
== Services ==
Started at 14:29 UTC, officially finished at 15:09.
The main issues we ran into were:
* the helm-charts service is unique and doesn't have a service IP,
causing the automatic switchover verification to break. This required us
to manually check the other services that come after it in the list, and
then re-run cookbook while excluding it. Tracked as
<https://phabricator.wikimedia.org/T285707>.
* the restbase-async service has some special handling, which we debated
on whether to follow that or not, opted to not special case it. Figuring
out what to do long-term is <https://phabricator.wikimedia.org/T285711>.
* the WDQS issue mentioned earlier.
== Traffic ==
Started at 15:43, finished at 15:45.
It took until ~16:25 for eqiad to mostly depool. There's not much else
to report, it went very smoothly.
== Tomorrow's MediaWiki switchover ==
Scheduled for 14:00 UTC <https://zonestamp.toolforge.org/1624888854>.
It is our goal to minimize the read-only time and make this a non-event
from a user perspective.
All of the coordination will take place in the #wikimedia-operations IRC
channel on Libera Chat You're more than welcome to follow along but if
you have questions, please ask them in #wikimedia-tech so it doesn't get
disruptive. The procedure that we'll be following is documented at
<https://wikitech.wikimedia.org/wiki/Switch_Datacenter#MediaWiki>.
I'm planning to do one more "live test" later today, will announce that
on IRC when it gets started.
-- Kunal