SRE: Difference between revisions

Content deleted Content added

Inline

Revision as of 16:50, 2 June 2021

TL;DR: If you need help from SRE but do not know which team is responsible for your question, just open a generic task on Phabricator in the SRE project and our Clinic Duty person of the week will route it.

Site Reliability Engineering (SRE) is responsible for availability, performance, monitoring, emergency response, infrastructure security, and capacity planning plus the maintenance of software used for that purpose. This is similar to what in many other organizations is handled by an Operations or System Administration team. SRE treats computer operations as a software problem and applies automation wherever possible. The foundation has a number of subteams with SRE, each responsible for different areas. Check here to see how to get in touch with those teams and here for a more detailed team structure.

SRE Data Center Operations - all things related to Data Centers, hardware maintenance and purchases
SRE Data Persistence - Databases and Object storage (MariaDB and Swift)
SRE Infrastructure Foundations - Automation and Networking (cumin, netbox, puppet, spicerack)
SRE Observability - Monitoring and Logging (Prometheus/Grafana and ElasticSearch, plus some Kafka)
SRE Service Operations - Mediawiki Operations and Supporting Services (Kubernetes, memcached, redis, Infrastructure for: Gitlab, OTRS, Phabricator)
SRE Traffic - Caching and DNS (ATS, varnish, GenDNS, wikidough)

References:

How complex systems fail This is where SRE works
Google's SRE books Google formalized many of the concepts and coined the term SRE

@@ Line 1: / Line 1: @@
+TL;DR: If you need help from SRE but do not know which team is responsible for your question, just open a generic task on Phabricator in the [https://phabricator.wikimedia.org/maniphest/task/edit/form/1/?projects=sre SRE project] and our Clinic Duty person of the week will route it.
+Site Reliability Engineering (SRE) is responsible for availability, performance, monitoring, emergency response, infrastructure security, and capacity planning plus the maintenance of software used for that purpose. This is similar to what in many other organizations is handled by an Operations or System Administration team. SRE treats computer operations as a software problem and applies automation wherever possible. The foundation has a number of subteams with SRE, each responsible for different areas. Check [https://wikitech.wikimedia.org/wiki/SRE_Team_requests here] to see how to get in touch with those teams and [https://www.mediawiki.org/wiki/Wikimedia_Site_Reliability_Engineering here] for a more detailed team structure.
+* [https://wikitech.wikimedia.org/wiki/Dc-operations SRE Data Center Operations] - all things related to Data Centers, hardware maintenance and purchases
-Site Reliability Engineering (SRE) is responsible for availability, performance, monitoring, emergency response, infrastructure security, and capacity planning plus the maintenance of software used for that purpose. This is similar to what in many other organizations is handled by an Operations or System Administration team. SRE treats computer operations as a software problem and applies automation wherever possible. The foundation has a number of subteams with its SRE team. Check [https://wikitech.wikimedia.org/wiki/SRE_Team_requests here] to see how to get in touch with those teams and [https://www.mediawiki.org/wiki/Wikimedia_Site_Reliability_Engineering here] for a more detailed team structure.
+* [https://wikitech.wikimedia.org/wiki/SRE/Data_Persistence SRE Data Persistence] - Databases and Object storage (MariaDB and Swift)
-* [https://wikitech.wikimedia.org/wiki/Dc-operations SRE Data Center Operations]
+* [https://wikitech.wikimedia.org/wiki/Infrastructure_Foundations SRE Infrastructure Foundations] - Automation and Networking (cumin, netbox, puppet, spicerack)
-* [https://wikitech.wikimedia.org/wiki/SRE/Data_Persistence SRE Data Persistence]
+* [https://wikitech.wikimedia.org/wiki/Observability SRE Observability] - Monitoring and Logging (Prometheus/Grafana and ElasticSearch, plus some Kafka)
-* [https://wikitech.wikimedia.org/wiki/Infrastructure_Foundations SRE Infrastructure Foundations]
+* [https://wikitech.wikimedia.org/wiki/SRE/Service_Operations SRE Service Operations] - Mediawiki Operations and Supporting Services (Kubernetes, memcached, redis, Infrastructure for: Gitlab, OTRS, Phabricator)
-* [https://wikitech.wikimedia.org/wiki/Observability SRE Observability]
+* [https://wikitech.wikimedia.org/wiki/Traffic SRE Traffic] - Caching and DNS (ATS, varnish, GenDNS, wikidough)
-* [https://wikitech.wikimedia.org/wiki/SRE/Service_Operations SRE Service Operations]
-* [https://wikitech.wikimedia.org/wiki/Traffic SRE Traffic]