Open Source for Open Knowledge
June 2021 Datacenter Switchover
In June 2021, most user traffic was switched from our primary Virginia datacenter to our secondary one in Texas. This post covers how the swtichover went and the issues that came up.
By Kunal Mehta, Site Reliability Engineer, Service Operations
The main reason we perform a datacenter switchover is to verify that in an emergency, we can switch to a different datacenter with minimal interruptions for users. All of our services and datacenters have redundant networking, power, disks, and more. Even then, freak accidents can happen
, and we need to be prepared.
We also used this time to perform maintenance in Virginia that’s cumbersome to do when we’re actively serving user traffic. For example, we’re currently swapping out
about 45 MediaWiki application servers for brand new hardware, giving users a slight performance boost. There’s also a large list
of pending database maintenance that was waiting for the switchover to happen.
The switchover itself was divided into three primary sections: Services, Traffic (caches), and MediaWiki.
At one point in time, MediaWiki was a large PHP application, but years ago, we started deconstructing it into a set of smaller services. Today, we have MediaWiki, which is still a large PHP application, and many services that provide some independent function to MediaWiki, such as maps, or math syntax, or even the WikiText parsing itself. For each switchover, we try to expand the list of services being switched. This time we included two more services
in this list, notably Swift
, which handles all of our media storage
Most of these are active-active, in that they run out of both datacenters at the same time. Under normal circumstances, we choose to use these in the same datacenter as MediaWiki. During the switchover, we moved usage to Texas to ensure we have enough capacity there to handle the load. Here’s an example of traffic shifting from Virginia to Texas for the Citoid
service, which fetches and generates reference templates and metadata.
During this process we identified a few issues:
- T285707: Our helm-charts service doesn’t have a service IP, causing it to fail verification that it switched over properly. This also interrupted the verification for the rest of the services, so we had to check them by hand.
- T285710: Monitoring for the Wikidata Query Service required manually switching the datacenter being monitored, causing lag to be misreported. Most Wikidata bots do check the amount of lag before editing, so they were stalled until it was manually switched.
Most requests for articles never hit MediaWiki itself. They’re served from our edge caches, typically the one closest to you, of: Virginia, Texas, California, Amsterdam, or Singapore. We disconnected Virginia by excluding it from our geographic DNS, where all countries are mapped to datacenters, and within a few minutes, nearly all of that traffic was going to Texas instead.
We didn’t run into any issues during this step.
MediaWiki is the application that powers all of our wikis. Work is ongoing to make it possible to run it in multiple datacenters
at the same time, but for now, it can only be active in one at a time. The process for switching datacenters for MediaWiki
is complex, but in brief entails setting the primary databases as read-only, waiting for replication to finish across into the other datacenter, and then lifting read-only mode in the new datacenter.
Because of how disruptive stopping edits is for wikis, we’ve been cutting down how long this read-only period takes, each time. This time, it only lasted 1 minute and 57 seconds, the fastest yet!
After the switch, the Turkish Wikivoyage
was unavailable for a few minutes because of a typo in the configuration. An incident report
was written for this, and a patch is pending review to prevent it from happening again.
Various other improvements to the automation around switching have been filed in Phabricator
We will switch back to our primary Virginia datacenter sometime in August once most maintenance has finished, allowing us to test the procedure once again. We also have Datacenter-Switchover
Phabricator projects tracking our work in this area to make Wikimedia wikis more resilient and available on a technical level.
About this post
June 2021 Datacenter Switchover”
this blog is really helpful for me.
The rollout of single-sign-on (SSO) at the Wikimedia Foundation
Sending messages to Wiki users in their preferred language