Incidents/2021-07-16 codfw network: Difference between revisions

From Wikitech
Content deleted Content added
Created page with "{{irdoc|status=draft}} <!-- The status field should be one of: * {{irdoc|status=draft}} - Initial status. When you're happy with the state of your draft, change it to status=review. * {{irdoc|status=review}} - The incident review working group will contact you then to finalise the report. See also the steps on Incident documentation. * {{irdoc|status=final}} --> == Summary == At 13:16, the asw-a2-codfw networking switch went down and did not return after we had remo..."
 
Tag: New redirect
Line 1: Line 1:
#REDIRECT[[Incident documentation/2021-07-16 asw-a2-codfw network]]
{{irdoc|status=draft}} <!--
The status field should be one of:
* {{irdoc|status=draft}} - Initial status. When you're happy with the state of your draft, change it to status=review.
* {{irdoc|status=review}} - The incident review working group will contact you then to finalise the report. See also the steps on [[Incident documentation]].
* {{irdoc|status=final}}
-->

== Summary ==
At 13:16, the asw-a2-codfw networking switch went down and did not return after we had remote hands powercycle it. Services with hosts in the impacted row (such as Swift and various mw-api servers) remained available for clients due to automatic failover and load balancing to remaining hosts. While mw-api remained available for end-users and external clients, the impacted Restbase load-balancer remained pooled causing Restbase to continue to try (and fail) to reach mw-api hosts. Thus, mobileapps API and cxserver API (which rely on Restbase) returned errors to clients for some time.

'''Impact''': For about 1 hour the Restbase, mobileapps, and cxserver services were serving errors.

'''Documentation''':
* <mark>Todo (Link to relevant source code, graphs, or logs)</mark>

== Actionables ==
<mark>Create a list of action items that will help prevent this from happening again as much as possible. Link to or create a Phabricator task for every step.</mark>

* [[phab:T286787|T286787]]: Mitigate unresponsive switch (Done).
* [[phab:T286924|T286924]]: LVS should handle losing a NIC on eqiad and codfw.
* [[phab:T286881|T286881]]: Audit eqiad & codfw LVS network links.
* [[phab:T286879|T286879]]: lvs2007, lvs2009, lvs2010 should not be on the same row A switch. (Done)

<mark>TODO: Add the [[phab:project/view/4758/|#Sustainability (Incident Followup)]] Phabricator tag to these tasks.</mark>

Revision as of 04:48, 18 August 2021