Incidents/2021-09-01 partial parsoid outage: Difference between revisions

From Wikitech
Content deleted Content added
Created page with "{{irdoc|status=draft}} <!-- The status field should be one of: * {{irdoc|status=draft}} - Initial status. When you're happy with the state of your draft, change it to status=review. * {{irdoc|status=review}} - The incident review working group will contact you then to finalise the report. See also the steps on Incident documentation. * {{irdoc|status=final}} --> ==Summary== At 02:05 UTC, the parse2007 server in Codfw started to spontaneously respond with fatal error..."
 
No edit summary
Line 50: Line 50:


* For 9 hours, no-one triaged the task, or independently noticed the error spike, or otherwise investigated it.
* For 9 hours, no-one triaged the task, or independently noticed the error spike, or otherwise investigated it.
* No alerts fired.
* No alerts fired. There is an alert query for appserver fatals, but it did not fire because for two reasons. 1) The alert only measures overall HTTP traffic, with the majority of requests being GET rather than POST. 2) For the parsoid cluster there is a high level of timeouts and OOMs that regularly produce fatals. Thus when new deterministic fatal errors are introduced that affect user submissions, due to bad code or an unhealthy server, this alert can't fire unless it rises significantly above the combined rate of timeouts, OOMs, and other GET failures.
**There is an alert query for appserver fatals, but it did not fire because for two reasons. 1) The alert only measures overall HTTP traffic, with the majority of requests being GET rather than POST. 2) For the parsoid cluster there is a high level of timeouts and OOMs that regularly produce fatals. Thus when new deterministic fatal errors are introduced that affect user submissions, due to bad code or an unhealthy server, this alert can't fire unless it rises significantly above the combined rate of timeouts, OOMs, and other GET failures.
**We also measure "mediawiki-new-errors" in Logstash, where this error stood out by more than 100X. However, we don't have an alert on that.
**We also measure "appserver HTTP 5xx for POST" in Prometheus/Grafana, where this stood as a 20X increase from <0.1% to >10%. However, we don't alert on that.
* Krinkle unable to ssh to parse2007.
* Krinkle unable to ssh to parse2007.



Revision as of 13:02, 1 September 2021

document status: draft

Summary

At 02:05 UTC, the parse2007 server in Codfw started to spontaneously respond with fatal error, possibly due to a php-opcache corruption. At 11:20 UTC, the server was restarted per the (now common) procedure in response to such corruptions after which the errors immediately stopped. Other parse servers and MW servers were not affected.

Impact: For 9 hours, 10% of submissions to Parsoid to parse or save wiki pages were failing on all wikis.

TODO: Measure or determine impact on user-facing edit save attempts. Are these automatically retried at some level? (Presumably not given POST. But, there was seemingly no noticable change in "edit count", nor "VisualEditor save rate", nor "visualeditoredit API responses".

Timeline

Fatal errors in Logstash.
Failing POST requests in Grafana.


All times in UTC.

  • 02:05: Outage starts (Retroactively) Logs indicate that parse2007: Cannot declare class XWikimediaDebug in XWikimediaDebug.php started at this time. This is affecting about 10% of POST requests
  • 03:54: mmodell, as train conductor this week, notices the spike and creates T290120.
  • 12:00 Krinkle finds the report during routine triage, notices that it is still on-going at high frequency, still unattended, and investigates.
  • 12:10: The issue is investigated and understood to be a likely opcache corruption, given that the source code in question has not changed recently and is not known to be defined or referenced in an unusual manner. The usual procedure for opcache corruptions is to restart php-fpm.
  • 12:19: Effie restarts php-fpm on parse2007
  • 12:20: Outage ends

Detection

Human reporting an error.

Write how the issue was first detected. Was automated monitoring first to detect it? Or a human reporting an error?

Copy the relevant alerts that fired in this section.

Did the appropriate alert(s) fire? Was the alert volume manageable? Did they point to the problem with as much accuracy as possible?

TODO: If human only, an actionable should probably be to "add alerting".

Conclusions

What went well?

  • A simliar issue was found a year earlier at T245183#6212267 (and remembered today), which let us jump to the opcache suspicion and thus restart the php-fpm service as mitigation.
  • Logstash dashboards.
  • Grafana dashboards.

What went poorly?

  • For 9 hours, no-one triaged the task, or independently noticed the error spike, or otherwise investigated it.
  • No alerts fired.
    • There is an alert query for appserver fatals, but it did not fire because for two reasons. 1) The alert only measures overall HTTP traffic, with the majority of requests being GET rather than POST. 2) For the parsoid cluster there is a high level of timeouts and OOMs that regularly produce fatals. Thus when new deterministic fatal errors are introduced that affect user submissions, due to bad code or an unhealthy server, this alert can't fire unless it rises significantly above the combined rate of timeouts, OOMs, and other GET failures.
    • We also measure "mediawiki-new-errors" in Logstash, where this error stood out by more than 100X. However, we don't have an alert on that.
    • We also measure "appserver HTTP 5xx for POST" in Prometheus/Grafana, where this stood as a 20X increase from <0.1% to >10%. However, we don't alert on that.
  • Krinkle unable to ssh to parse2007.

Where did we get lucky?

  • Urbanecm was around to offer a workaround for the ssh issue using Keyholder.
  • Effir was around as well, who did the actual restart.
  • (To be confirmed) Are there retries at some level? I suspect not given POST retries are usualy unsafe, but there is also an apparent lack of observed impact on edit count, Restbase reqs, and VisualEditor saves.

How many people were involved in the remediation?

  • 1 engineer.
  • 1 SRE.

Links to relevant documentation

Actionables

  • Let deployers ssh to parse hosts. T290144
  • TODO: Re-evaluate alerting strategy around "mediawiki-exceptions". We have a breakdown by cluster (web,api,job,parsoid). Do we need a breakdown by HTTP verb? (E.g. "read" GET/HEAD/OPTIONS vs write "POST/DELETE" or some such).
  • TODO: I was unable to find stats on error rates of api.php requests in Grafana. HTTP-level app server stats are insufficient since api errors are HTTP 200. The Graphite metrics for API req breakdown don't measure errors currently. The Logstash data for api-appservers errors is also insufficient since properly handled errors wouldn't be exceptions and wouldn't be logged there as such (e.g. when action=visualeditoredit finds Restbase/Parsoid respond with http 500, it responds to the client with an error. Where do we measure this?)

TODO: Add the #Sustainability (Incident Followup) Phabricator tag to these tasks.