SLO/etcd main cluster
< SLO
Contents
1SLO Worksheet - etcd Main
1.1Service
1.2Teams
1.3Architectural
1.3.1Hard Dependencies
1.3.2Soft Dependencies
1.4Client-facing
1.4.1Clients
1.4.2Request Classes
1.5Service Level Indicators (SLI)
1.6Operational
1.6.1Monitoring
1.7Troubleshooting
1.8Deployment
1.9Service Level Objectives
1.10References
SLO Worksheet - etcd Main
Service
etcd is an open source key-value store with a focus on reliability that is used to store configuration and state data for distributed systems. At WMF we run a number of etcd clusters, this document addresses the two etcd Main clusters, one each installed in the primary datacenters, eqiad and codfw. A number of applications, including mediawiki read/write configuration store state data on etcd.
Teams
etcd is owned by the Service Operations SRE team, which is responsible for all aspects including operation, scalability, backups and software updates. Contact: sre-serviceops@wikimedia.org and https://office.wikimedia.org/wiki/Contact_list#Service_Operations
Architectural
An etcd Main cluster consists of 3 nodes. Each of the etcd nodes can answer read requests, but write requests are handled by a single node, the “leader”. If the leader node becomes non functional the remaining nodes (the “followers”) elect a new leader maintaining the cluster functional. The election of the new leader is managed via the RAFT algorithm. etcd itself replicates data between the nodes and each node has a complete copy of the data. High availability is provided by etcd out of the box, by a combination of etcd client and server software.
Hard Dependencies
Etcd is a foundational service and does not have any hard dependencies beyond hardware and networking. It is worth pointing out that server hardware and networking have their own failure rates that are in the 99% range. Etcd as configured is able to deal with a certain type of failures in a local datacenter.
Soft Dependencies
Etcd is a foundational service and does not have any soft dependencies beyond hardware and networking.
Client-facing
Clients
softwareuseconnectionintervalfailure mode
(etcd down)
pybal/LVSretrieve LB pools servers lists, weights, statecustom python/twisted, host onlywatchwill keep working until restart
varnish/trafficretrieve list of backend serversconfd3swill keep working
gdnsd/auth dnsWrite admin state files for discovery.wmnet recordsconfd3swill keep working
scap/deploymentDsh listsconfd60 swill keep working
redisReplica configuration (changes NOT applied)confd60 swill keep working
parsoidUse of http or https to connect to the mw apiconfd60 swill keep working
MediaWikifetch some config variablesphp-curl.
PHP requests at intervals, cached in APCu.
10 swill keep working until restart
Icinga serversUpdate a local cache of the last modified index to be used by other checkscURL30 sthe checks will use stale data for comparison
conftool/dbctl/cuminUsed to pool/depool hosts or datacenters and populate data after a puppet-mergePython requestsN/AWill fail, but that’s ok
Confd: a lightweight configuration management daemon focused on keeping local configuration files up-to-date using data stored in etcd - Wikitech
Host only: the client can only connect to a single host, has no failover capability.
Request Classes
Reads:
GET/QGET/HEAD requests. Those are the requests that are only reading from the datastore. It’s the bulk of the requests, amounting over the course of 30 days to 120-130 Million GET requests and 350-400 Million Quorum GET (QGET) requests.
Writes:
PUT/DELETE requests. Those are only sent by conftool, dbctl and cumin. They are very rare compared to the read requests. Over the course of 30 days, we count DELETEs in the ballpark of 50-100 and PUTs on the order of 3000-4000.
Service Level Indicators (SLI)
SRE looks at three SLIs for etcd: availability, acceptable latency rate and error rate.
We measure over a three-month SLO period that ends 1 month before the fiscal quarter for reporting reasons, i.e. the  SLO period is December, January, February and so forth.
Availability is determined by monitoring all transactions and calculated for the SLO period: dividing the difference of all transactions and the number of failed (all errors, except 404) or slow (i.e. over 32 ms) transactions by all transactions.
Acceptable latency is determined by monitoring all transactions against the etcd store and calculated over our SLO period: dividing the difference of all transactions and the number of transactions over 32 ms by all transactions.
Error rate is determined by monitoring all transactions against the etcd store and calculated over the SLO period: dividing the number of failed transactions by all transactions.
SLI for GET requests in %:
   100% * (all GET transactions - GET transaction slower than 32 ms) / all GET transactions
SLI for Errors:
    100% * (transactions with an error code >= 500) / all transactions
Reference: Golden signals includes latency, traffic, errors, and saturation
Operational
Monitoring
Sample values:
Latency: p99 +/- 15 ms GET
Latency: p99 +/-  2ms (that’s 50%) QGET
Failures: 0.00 fps about 300 rps
All errors are 404 at the moment, they are not counted, we only count 500 type errors.
Troubleshooting
The health of the etcd Main cluster is monitored by icinga. In case of alerts follow the troubleshooting procedures defined in the alert message, hosted on wikitech at Etcd/Main cluster.
Deployment
The service isn’t often deployed. It only gets deployed when doing OS upgrades.
Service Level Objectives
References
Whole book - 2 copies Google SRE Drive
We do have metrics for etcd, we lack a dashboard - dashboard now exists:
Examples of past issues: Performance degradation due to RAID resyncing starving etcd from needed IOPS.
Hardware failures
Search
This page was last edited on 8 June 2021, at 19:12.
Text is available under the Creative Commons Attribution-ShareAlike License; additional terms may apply. See Terms of Use for details.
Privacy policy
About Wikitech
Disclaimers
Code of Conduct
Mobile view
Developers
Statistics
Cookie statement
Create accountLog in
PageDiscussion
ReadView sourceView history
Visit the main pageMain pageRecent changesServer admin log: ProdAdmin log: RelEngIncident statusDeploymentsSRE/Operations HelpCloud VPS portalToolforge portalRequest VPS projectAdmin log: Cloud VPSWhat links hereRelated changesSpecial pagesPermanent linkPage informationCite this pageCreate a bookDownload as PDFPrintable version