Etcd/Main cluster

From Wikitech

The main etcd cluster is the Etcd cluster used as a state management system for the WMF production cluster. It is operated by SRE Service Ops under the etcd main cluster SLO.

Usage in production

More and more systems depend on etcd for retrieving state information. All current uses are listed in the table below

software use connection interval failure mode
pybal/LVS retrieve LB pools servers lists, weights, state custom python/twisted, host only watch will keep working until restart
varnish/traffic retrieve list of backend servers; retrieve VCL fragments (requestctl) confd (watch) watch will keep working
gdnsd/auth dns Write admin state files for discovery.wmnet records confd watch will keep working
scap/deployment Dsh lists confd 60 s will keep working
MediaWiki fetch some config variables PHP connection, request at intervals 10 s will keep working until restart
Icinga servers Update a local cache of the last modified index to be used by other checks cURL 30 s the checks will use stale data for comparison

In a failure, all systems will become unable to modify any configuration it derives from etcd, but they will keep working. Only a subset of those will survive a service restart though.

Architecture

The main cluster is composed of two separated sub-clusters: the "codfw.wmnet" and "eqiad.wmnet" ones (creatively name after the datacenters they're located in) that are not connected via RAFT consensus, but via replication, so that there is always a master cluster and a slave one.

Consistency

For reads that don't require sub-second consistency cluster-wide, reading from the slave cluster is acceptable. If replication breaks, this will page opsens that will be able to correct the issue quickly enough (worst case scenario, by pointing clients to the master dc), All writes should go to the master datacenter; we ensure that the slave cluster is in read-only mode for remote clients to avoid issues.

Replication

Replication works using etcdmirror - a pretty raw software we wrote internally that allows replicating from one cluster to another mangling key prefixes. This is supposed to offer the functionality that etcdctl mirror-maker provides on etcd 3 to etcd 2 clusters.

Etcdmirror runs from one machine on the slave cluster; it reads the etcd index to replicate from in /__replication/$destination_prefix , and issues a recursive watch request to the source cluster starting at the recorded index, and then recursively replicating every write that happens under $source_prefix in the source cluster. Since we're (at the moment) only interested in the /conftool directory, that's what we're replicating between the two clusters. Logs from the application are usually pretty telling about what is going wrong.

The replica daemon is very strict and will fail as soon as any inconsistency is found (even just in the original value of a key) or if the lag is large enough that we're losing etcd events. In such a case, you will need to do a full reload; to do that you need to launch etcdmirror adding the --reload switch. Beware: doing so will ERASE ALL DATA on the destination cluster, so do that with extreme caution.

Individual cluster configuration

We decided to proxy external connections to etcd via an nginx proxy that handles TLS and HTTP authentication and should be fully compliant with etcd's own behaviour. The reason for this is that the builtin authentication gives a severe performance hit to etcd, and that our TLS configuration for nginx is much better than what etcd itself offers. It also gives us the ability to switch on/off the read-only status of a cluster by flipping a switch in puppet. I don't know of any way to do this with the standard etcd mechanism without actually removing users and/or roles, a slow process that is hard to automate/puppetize. So what happens is that on every host we have an etcd instance listening for client connections on http://127.0.0.1:2378 with no authentication. So local clients can write to it unauthenticated. It does, however, advertise https://$fqdn:2379 as its client URL, which is where NGINX is listening for external connections and enforces authentication as well.

Operations

For the most part, you can refer to what is written in Etcd, but there are a few more operations regarding replication that are not covered there.

Master cluster switchover

From https://phabricator.wikimedia.org/T166552

Play-by-play:

These instructions assume the primary cluster is currently codfw and moving to eqiad. When moving in the opposite direction, swap the data centers accordingly in each step.

  1. Reduce the TTL for conftool SRV records to 10 seconds
  2. On authdns1001.eqiad.wmnet (or any authdns machine), sudo authdns-update
  3. Start read-only in the dc we're switching from  (https://gerrit.wikimedia.org/r/356138)
  4. sudo cumin A:conf-codfw 'run-puppet-agent' (begins read-only)
  5. Verify that etcd is read-only by attempting to depool a server with conftool; it should fail.
  6. To avoid being paged for etcdmirror replication delay, visit icinga and downtime the "Etcd replication lag #page" service.
  7. sudo cumin A:conf 'disable-puppet "etcd replication switchover"'
  8. Stop replication in the dc we're switching to (https://gerrit.wikimedia.org/r/#/c/356139)
  9. sudo cumin 'A:conf-eqiad' 'run-puppet-agent -e "etcd replication switchover"' (stops replica in eqiad)
  10. Switch the conftool SRV record for read-write access to the dc we're switching to, updating the port if necessary (https://gerrit.wikimedia.org/r/#/c/356136/)
  11. On authdns1001.eqiad.wmnet (or any authdns machine), sudo authdns-update
  12. sudo cumin 'conf2002.codfw.wmnet' 'python /home/oblivian/switch_replica.py conf1001.eqiad.wmnet conftool' (sets the replication index in codfw)
  13. sudo cumin A:conf-codfw 'run-puppet-agent -e "etcd replication switchover"' (starts replica in codfw)
  14. Set the dc we're switching to as read-write (https://gerrit.wikimedia.org/r/356341)
  15. sudo cumin A:conf-eqiad 'run-puppet-agent' (ends read-only)
  16. Verify that etcd is read-write again by depooling and repooling a server with conftool; this time it should succeed.
  17. Verify that etcdmirror is replicating correctly by tailing /var/log/etcdmirror-conftool-eqiad-wmnet/syslog.log in codfw; you should see updates corresponding to the depool and repool in the last step.
  18. Restore the TTL to 5 minutes https://gerrit.wikimedia.org/r/#/c/356137/
  19. On authdns1001.eqiad.wmnet (or any authdns machine), sudo authdns-update


Reimage cluster

Steps to reimage a conf cluster step by step (in this case conf2 in codfw at the example of doing conf2004).

Be aware that this might not reflect current reality! LVS hosts might have changed, their role might have changed ... actually everything might have changed. So please use this as a starting template and DOUBLE CHECK EVERY STEP BEFORE YOU EVEN START!
  • Change SRV client record to point to the other cluster gerrit
  • Update authdns
ssh dns1004.wikimedia.org "sudo -i authdns-update"
  • Restart all confd instances and navtimint to pick upt he new DNS records
# batched restart of confd 
sudo cumin -b 50 -s 20 'C:confd' 'systemctl restart confd' 
# and navtiming 
sudo cumin webperf2003.codfw.wmnet 'systemctl restart navtiming.service'
  • Make Pybal use the other cluster gerrit
sudo cumin 'P{O:lvs::balancer} and (A:codfw or A:eqsin or A:ulsfo)' 'run-puppet-agent' 
# check lvs config 
sudo cumin 'P{O:lvs::balancer} and (A:codfw or A:eqsin or A:ulsfo)' 'grep conf1 /etc/pybal/pybal.conf || true' 
 
# LOG TO SAL 
# restart pybal on secondaries 
sudo cumin 'lvs2014.codfw.wmnet,lvs5006.eqsin.wmnet,lvs4010.ulsfo.wmnet' 'systemctl restart pybal' 
 
# LOG TO SAL 
#restart pybal on primaries 
sudo cumin -b 1 -s 5 'lvs201[1-3].codfw.wmnet,lvs500[4-5].eqsin.wmnet,lvs400[8-9].ulsfo.wmnet' 'systemctl restart pybal'
  • Ensure nothing uses etcd on conf2*
    • Check /var/log/nginx/etc*access.log
    • Check ss -apn | grep 4001
  • Reimage
# reimage 
sudo cookbook sre.hosts.reimage --os bullseye -t T332010 conf2004
  • Delete and re-add the etcd member from the etcd cluster
# Get the member-id of conf2004 
etcdctl -C https://$(hostname -f):2379 member list 
# Using etcdctl does not work here because it will use tcp/4001 as client port which will block writing to /v2/member 
curl -X DELETE https://$(hostname -f):2379/v2/members/<MEMBER-ID> 
curl -X POST https://$(hostname -f):2379/v2/members -H "Content-Type: application/json" -d '{"peerURLs":["https://conf2004.codfw.wmnet:2380"]}'
  • Restart etcd on the reimaged host with ETCD_INITIAL_CLUSTER_STATE="existing"
systemctl stop etcd 
source /etc/default/etcd 
rm -rf ${ETCD_DATA_DIR}/* 
sed -i 's/ETCD_INITIAL_CLUSTER_STATE.*/ETCD_INITIAL_CLUSTER_STATE="existing"/' -i /etc/default/etcd 
systemctl start etcd 
run-puppet-agent
  • Ensure etcd and zookeeper are happy before going to the next one
echo ruok | nc localhost 2181; echo; echo stats | nc localhost 2181; echo; etcdctl -C https://$(hostname -f):2379 cluster-health

See also