Incidents/2021-07-16 codfw network: Difference between revisions
Content deleted Content added
Created page with "{{irdoc|status=draft}} <!-- The status field should be one of: * {{irdoc|status=draft}} - Initial status. When you're happy with the state of your draft, change it to status=review. * {{irdoc|status=review}} - The incident review working group will contact you then to finalise the report. See also the steps on Incident documentation. * {{irdoc|status=final}} --> == Summary == At 13:16, the asw-a2-codfw networking switch went down and did not return after we had remo..." |
Redirected page to Incident documentation/2021-07-16 asw-a2-codfw network Tag: New redirect |
||
Line 1: | Line 1: | ||
#REDIRECT[[Incident documentation/2021-07-16 asw-a2-codfw network]] |
|||
{{irdoc|status=draft}} <!-- |
|||
The status field should be one of: |
|||
* {{irdoc|status=draft}} - Initial status. When you're happy with the state of your draft, change it to status=review. |
|||
* {{irdoc|status=review}} - The incident review working group will contact you then to finalise the report. See also the steps on [[Incident documentation]]. |
|||
* {{irdoc|status=final}} |
|||
--> |
|||
== Summary == |
|||
At 13:16, the asw-a2-codfw networking switch went down and did not return after we had remote hands powercycle it. Services with hosts in the impacted row (such as Swift and various mw-api servers) remained available for clients due to automatic failover and load balancing to remaining hosts. While mw-api remained available for end-users and external clients, the impacted Restbase load-balancer remained pooled causing Restbase to continue to try (and fail) to reach mw-api hosts. Thus, mobileapps API and cxserver API (which rely on Restbase) returned errors to clients for some time. |
|||
'''Impact''': For about 1 hour the Restbase, mobileapps, and cxserver services were serving errors. |
|||
'''Documentation''': |
|||
* <mark>Todo (Link to relevant source code, graphs, or logs)</mark> |
|||
== Actionables == |
|||
<mark>Create a list of action items that will help prevent this from happening again as much as possible. Link to or create a Phabricator task for every step.</mark> |
|||
* [[phab:T286787|T286787]]: Mitigate unresponsive switch (Done). |
|||
* [[phab:T286924|T286924]]: LVS should handle losing a NIC on eqiad and codfw. |
|||
* [[phab:T286881|T286881]]: Audit eqiad & codfw LVS network links. |
|||
* [[phab:T286879|T286879]]: lvs2007, lvs2009, lvs2010 should not be on the same row A switch. (Done) |
|||
<mark>TODO: Add the [[phab:project/view/4758/|#Sustainability (Incident Followup)]] Phabricator tag to these tasks.</mark> |