MediaWiki Engineering/Guides/Monitor production errors

From Wikitech
Revision as of 16:45, 28 April 2022 by Brennen Bearnes (talk | contribs) (→‎Create an OpenSearch filter: Add query DSL, add formatting to some instructions.)

This is a guide for how to use Kibana (Logstash) to monitor and reporting production errors and exceptions to Phabricator.

The exception triaging process can start in either Phabricator or Kibana. Phabricator holds tasks already filed by others that may need reviewing or validation. The Kibana dasboard can be used to triage new issues and filter out known ones.

Any production exception reported will end up in the Untriaged column of the Wikimedia-production-error workboard in Phabricator.

Overview

Terminology

ELK stack

  • Elasticsearch: Where log events are stored.
  • Logstash: Does processing and injection into Elasticsearch.
  • Kibana: User interface that reads from Elasticsearch. Lives at https://logstash.wikimedia.org

MediaWiki/PHP pushes messages into Logstash, which stores them in Elasticsearch, where they can be viewed via Kibana.

Kibana

There are 2 main dashboards to keep track of:

mediawiki-errors This dashboard is used to keep track of errors encountered on hosted WMF wikis. It contains logs for errors and exceptions encountered such as runtime errors, logic errors, memory limits exceeded, timeouts, etc.

mediawiki-new-errors This dashboard is a copy of mediawiki-errors, with filters applied to remove errors that already have Phabricator tasks associated with them, and are therefore already reported.

Filters

You can manipulate your view of the mediawiki-new-errors dashboard: Temporarily disable a filter on mediawiki-new-errors by hovering over a filter and clicking the checkbox icon. Delete a filter by clicking the trashcan icon. (Be cautious when deleting, once saved, this will affect everyone’s view and should only be done when the error is obsolete or fixed.)

mediawiki/tools/release contains some scripting for checking the status of filters which mention Phabricator tasks in the check-new-error-tasks directory.

Logs

Log events have multiple attributes such as channel, reqId, url, wiki, etc. It is important to note that logs are purged after 90 days in line with the Foundation's Privacy Policy. A few notable attributes are: mwversion: MediaWiki version and wmf branch name. This is useful to determine if an error is in the new code still riding the train, or in fully deployed code.

  • channel: Indicates where the error originated
  • exception : Fatal exception (any uncaught exception, or timeout, out of memory, etc).
  • error:Native PHP silent error (such as undefined variables).
  • exception.class: The class name of the exception object that led to the problem. This is the object that propagated from a throw statement, and describes the kind of error.
  • exception.message: The actual message describing the event.
  • exception.file: This is the exact line of code where the error happened. It is the start of the stack trace, also known as the call site.
  • exception.trace: The full stack trace for the event.
  • url:HTTP url of the server request that failed.
  • referrer: The previous or parent HTTP url from a browser. If present and set to a url with a WMF domain, it generally indicates that an error was experienced in a browser on one of our user-facing web pages (instead of via a bot or app).
  • message: Specific message for this event, including full details and identifiers.
  • normalized_message: Like “message” but with variable values replaced by placeholders. Useful for determining how often a particular category of message is occurring.

Helpful Filter Attributes

  • normalized_message: Prefer this attribute for filtering over the “message” field. The “message” field contains a unique request ID and normalized message does not, to encompass different requests that have the same error message. It ignores values from fields such as “reqId” and “url”.

Other places to find logs

Run logspam-watch to get a filtered log of real-time occurring errors including information on: MW version, number of occurrences of the log, and the exception message

Reporting An Error

You notice an error on the mediawiki-new-errors dashboard, signifying it does not yet have an associated Phabricator task. There are 2 steps to take:

  1. Create a Phabricator task.
  2. Create a Kibana filter.

Create a Phabricator Task

There are two ways to create a Phabricator Task. Manually, or automatically generated from an existing Kibana log event.

  1. Manual Creation
    1. Click New Task
    2. Click New Task a second time, and click Report Error Code
  1. Automatic Creation from Kibana Log
    1. Expand one of the log event rows for the error you want to report
    2. Click Phatility > Submit.
    3. Check that the necessary information was autofilled.
      1. exception.trace - the stack trace. If not, manually copy it from the Kibana log
      2. reqId - unique request ID that the developer investigating the Phabricator task can use to find this Kibana log again (as well as to find any other errors that were emitted during the same request).
    4. Ensure no personally identifiable information (PII) is submitted with task
      1. Replace any of these with wildcards or such: User name, Article name, IP address, exact query parameters, exact timestamps. Be careful not to expose that a certain user was viewing a certain article, or that a certain article was viewed at a certain time.
        1. Example: from wikipedia.org/wiki/ABC?wprov=123 to wikipedia.org/wiki/Pagename or /w/api.php?format=json&action=query&list=contribs&lc=John to /w/api.php?action=query&list=contribs&…

Additional Information

Add Tags

Component
Look through the stack trace, or other log attributes, to try and find the appropriate component(s) to tag with the task. Oftentimes the stack trace will reveal clues about which component of MediaWiki core and/or which extension an error originated from. Try to narrow down the source of the error by identifying the non-generic code path. For example, for the MW API any code inside ApiBase or ApiMain could be considered generic.
Team
Use the list at mediawiki.org/Maintainers to find what team(s) own the component(s), and tag them as well. “wikimedia-production-error” is the default tag which adds it to the production errors Phabricator workboard.

Optional Steps

  1. If the task seems like it would be detrimental to production environments, you can mark this task as a train blocker. When in doubt, mark as train blocker and reach out to ReleaseEngineering to confirm.
    1. Click Edit Parent Task
    2. Add the train blocker task for the week and add any extra information you can. For POST requests to the API, the request parameters are not available via Kibana. To find these, ssh to mwlog1001 and grep /srv/mw-log/api.log for the reqId. The log here will contain the request parameters. (See https://wikitech.wikimedia.org/wiki/Production_access if you need access to production servers.)
      1. Try to find how frequent the error is and if it’s likely new this week or whether it’s been recorded in previous week as well. To do this, use the (unfiltered) mediawiki-errors dashboard, and use the query bar to enter something like “message:<part of the error>”. Then search through the last 30 days to see how often and since when it is happening. Expand a few log event rows to see if it looks like the same issue indeed.


Create an OpenSearch filter

A new filter should be added to the mediawiki-new-errors dashboard so that we don’t report the same error multiple times and so that new errors stand out.

Decide what to filter. Expand a specific log event and look for the exception.file field (or normalized_message).

We need to decide whether to exclude similar errors by call site (exception.file, start of stack trace) or by error message (normalized_message). It is preferred to exclude by exception.file because these can’t accidentally filter unrelated errors. To create the exclusion filter:

  1. Expand one of the log event rows relating to the problem we just reported to Phabricator.
  2. Click the negative magnifying glass icon with the hover value Filter out Value. This will create a filter to exclude all log events that are from this line of code.
    1. As long as the entry for exception.file is not something generic like MWDebug.php or MWExceptionHandler.php, then we’ll filter by that. Otherwise, fall back to filtering out by normalised_message instead.
  3. Click on the new filter’s pencil icon to edit.
  4. Set a Label
    1. Usually a combination of Phabricator task ID + very short summary. Example: T231084 - Bad $oldContent param.
  5. Click Save on the filter.
  6. Save the dashboard by clicking Edit in the gray bar at the top, then clicking Save.

Filtering by query string

For trickier filters, it may be helpful to use full text queries. Click on Edit filter, then Edit as Query DSL. As an example, when trying to match an error like:

Wikimedia\Rdbms\DBQueryError: Error 1213: Deadlock found when trying to get lock; try restarting transaction (db1181)
Function: MediaWiki\User\UserOptionsManager::saveOptionsInternal
Query: [query redacted]

You could use:

{
  "query": {
    "query_string": {
      "query": "UserOptionsManager saveOptionsInternalQuery",
      "fields": [
        "exception.message"
      ]
    }
  }
}

This supports some level of wildcard notation - see OpenSearch documentation for details.

After Task Resolution

After a Phabricator task has been completed, be sure to take a few more steps:

  1. Open the mediawiki-new-errors dashboard afresh.
  2. Delete the filter for this T-number from mediawiki-new-errors by clicking the trashcan icon.
  3. Edit > Save, to save these changes.