SLO Worksheet - etcd Main etcd is an open source key-value store with a focus on reliability that is used to store configuration and state data for distributed systems. At WMF we run a number of etcd clusters, this document addresses the two etcd Main clusters, one each installed in the primary datacenters, eqiad and codfw. A number of applications, including mediawiki read/write configuration store state data on etcd.
An etcd Main cluster consists of 3 nodes. Each of the etcd nodes can answer read requests, but write requests are handled by a single node, the “leader”. If the leader node becomes non functional the remaining nodes (the “followers”) elect a new leader maintaining the cluster functional. The election of the new leader is managed via the RAFT algorithm. etcd itself replicates data between the nodes and each node has a complete copy of the data. High availability is provided by etcd out of the box, by a combination of etcd client and server software.
Etcd is a foundational service and does not have any hard dependencies beyond hardware and networking. It is worth pointing out that server hardware and networking have their own failure rates that are in the 99% range. Etcd as configured is able to deal with a certain type of failures in a local datacenter.
Etcd is a foundational service and does not have any soft dependencies beyond hardware and networking.
Confd: a lightweight configuration management daemon focused on keeping local configuration files up-to-date using data stored in etcd - Wikitech
Host only: the client can only connect to a single host, has no failover capability.
GET/QGET/HEAD requests. Those are the requests that are only reading from the datastore. It’s the bulk of the requests, amounting over the course of 30 days to 120-130 Million GET requests and 350-400 Million Quorum GET (QGET) requests.
PUT/DELETE requests. Those are only sent by conftool, dbctl and cumin. They are very rare compared to the read requests. Over the course of 30 days, we count DELETEs in the ballpark of 50-100 and PUTs on the order of 3000-4000.
Service Level Indicators (SLI)
SRE looks at three SLIs for etcd: availability, acceptable latency rate and error rate.
We measure over a three-month SLO period that ends 1 month before the fiscal quarter for reporting reasons, i.e. the SLO period is December, January, February and so forth.
Availability is determined by monitoring all transactions and calculated for the SLO period: dividing the difference of all transactions and the number of failed (all errors, except 404) or slow (i.e. over 32 ms) transactions by all transactions.
Acceptable latency is determined by monitoring all transactions against the etcd store and calculated over our SLO period: dividing the difference of all transactions and the number of transactions over 32 ms by all transactions.
Error rate is determined by monitoring all transactions against the etcd store and calculated over the SLO period: dividing the number of failed transactions by all transactions.
SLI for GET requests in %:
100% * (all GET transactions - GET transaction slower than 32 ms) / all GET transactions
SLI for Errors:
100% * (transactions with an error code >= 500) / all transactions
Reference: Golden signals includes latency, traffic, errors, and saturation
Latency: p99 +/- 15 ms GET
Latency: p99 +/- 2ms (that’s 50%) QGET
Failures: 0.00 fps about 300 rps
All errors are 404 at the moment, they are not counted, we only count 500 type errors.
The health of the etcd Main cluster is monitored by icinga. In case of alerts follow the troubleshooting procedures defined in the alert message, hosted on wikitech at Etcd/Main cluster.
The service isn’t often deployed. It only gets deployed when doing OS upgrades.
- Request SLO: 99.9% of requests will be successful, resulting in an Request Error Rate Budget of 0.1% of requests.
- Latency SLO: 99.8% of requests will be under 32 ms, resulting in an Latency Error Budget of 0.2%.
Whole book - 2 copies Google SRE Drive
We do have metrics for etcd, we lack a dashboard - dashboard now exists:
Examples of past issues: Performance degradation due to RAID resyncing starving etcd from needed IOPS.
- AWS EC2 SLA: 90% single EC2 VM
- Azure: 95% for single VM with HD, 99.5% for VM with SSD, Premium VM 99.9%
This page was last edited on 8 June 2021, at 19:12.