SRE/Observability: Difference between revisions

From Wikitech
< SRE
Content deleted Content added
No edit summary
LMata (talk | contribs)
added how we work landing page
Line 11: Line 11:
|-
|-
|}
|}
</div></div></div>
</div>


= How we work. =
The starting point for observability resources at Wikimedia SRE.


=== Alerts ===
=== Overview ===
The Observability team (often referred to as o11y) maintains several tools while curating and building a collaborative roadmap for the Wikimedia Foundation. Maintaining several of these work streams provide challenges as there is work that is adjacent, related, and directly assigned to the Observability team workboard but is not easily distinguishable from each other.
* [https://icinga.wikimedia.org/alerts icinga.w.o/alerts]: central monitoring and alerting platform. See also [[Icinga]].
* [https://upload.wikimedia.org/wikipedia/labs/0/0a/Alerting_Infrastructure_design_document_%26_roadmap.pdf Alerting infrastructure roadmap] PDF


The purpose of this document is to describe:
=== Logs ===
* [https://logstash.wikimedia.org/app/kibana Kibana] (a.k.a. logstash): central logging platform. See also [[Logstash]].
* [https://upload.wikimedia.org/wikipedia/labs/5/58/Logging_infrastructure_design_document.pdf Logging infrastructure design document] PDF


* How the #o11y team does its work
=== Metrics ===
* How other teams can request assistance or time on our roadmap
* [https://grafana.wikimedia.org/ grafana.w.o]: central observability platform. See also [[Grafana.wikimedia.org|Grafana]].
* How we will surface scheduling dependencies
* [[Prometheus]], recommended and supported metrics toolkit
* How we will use Phabricator to boost visibility
* [[Graphite]], supported but deprecated time series framework
* What tools we intend to use
* [[Statsd]], supported but deprecated metrics aggregation

* [[Observability/Dashboard_guidelines]], ideas towards better dashboards
=== How to connect (and why) ===

* Reach out:
** IRC #wikimedia-sre-observability
** Email [[Mailto:sre-observability@wikimedia.org|sre-observability@wikimedia.org]]
** Phabricator tags:
*** sre-observability, observability, o11y, wikimedia-logstash, …
** VictorOps in case of emergencies

== How we Organize our Work ==
Work comes from multiple sources, but most requests should land in Phabricator. The #observability tag/project is the tag we use for all incoming work. Tasks will be one of six major states: inbox, backlog, scheduled, in progress, radar, done/closed.

* '''Inbox''': this is the default state, all work lands in #observability inbox
* '''Backlog''': work is accepted but unscheduled (we don't know when it will be worked on, but we know its a good thing to do)
* '''Scheduled''': work is scheduled to do (aka prioritized) we know what needs to be done, who will do it and when it will be done, this will be manifested as a FY-Q# tag (FY2021/2022 Q1 milestone tag or similar), plus an inbox for that tag or an UP Next for that milestone.
* I'''n progress''': this indicates that the work is in progress and actively being worked on.
* '''Radar''': this is work we consider we want to be informed but will not be on our workboard this is only for the #observability component tag kept for visibility purposes.
* '''Done'''. this will be a relative state tagged in a done column under a specific milestone when it gets completed.

The Observability team grooms incoming tasks on a weekly basis normally during planning meetings on Monday at 8:00 AM Pacific. Some requests may receive an out of band prioritization effort.

This happens by reviewing the inbox for the #observability (component) workboard and the #sre-observability (group) workboard. (Inbox for a quarter or UP next)

From there the task or request should go through a quick prioritization of "done this quarter" i.e. time sensitive, or backlog if the task is actionable for the Observability team. Otherwise the task goes to radar or is blocked in the backlog if unable to move forward. Tasks which do not have enough information provided to groom will receive a follow-up comment and remain unprioritized in the general backlog until enough information has been collected to effectively perform the task.

== Phabricator Workflow ==

* Tasks go into the Observability intake project (#observability)
* Tasks are then groomed and tagged with the appropriate subcomponent (subproject) area; one of the four:
** o11y-alerting
** o11y-metrics
** o11y-logging
** o11y-tracing* (tbd later)
* Each of these areas is a subproject and prioritized accordingly to manage a product roadmap for each of these main tracks of work
* Tasks then get tagged into a milestone for execution
* Then tasks are either scheduled as up next or scheduled to a specific quarter
* The process is there iterated on periodically (weekly,quarterly, yearly)

= How we Plan our Roadmap =
Roadmap planning will be a rolling 1+ year roadmap with the goal to have a list of tasks pre-groomed and prioritized periodically (quarterly).

There are 6 major work categories that drive efforts of the o11y team:

* Alerting
* Metrics
* Logs
* Tracing (future)
* Maintenance/Incidents
* Incident Management

The goal of this process is to:

* Drive each of these major workstreams
* Clearly set goals and deliverables
* Quantify effort and time investment per workstream, according to the level of investment that the organization is interested in per each individual initiative, and
* Allocate the adequate amount of time to each initiative.

== Prioritization ==
This is both a scheduled and a continual effort in sizing up work and importance/impact of specific work streams. The team is employing a simple forced rank list of priorities that are fed from the intake process and groomed by the team. This effort in turn is then taken to a spreadsheet where these projects are scored for overall feel on value and capacity.

Order of presence for prioritization:

* Tier1 ('''high'''): incidents, security events, privacy concerns, PII in logs, unbreak now events
* Tier2 ('''medium'''): project work (OKR), outside requests, maintenances
* Tier3 ('''low'''): non critical maintenances
* Tier4 ('''lowest'''): work to be done when all other prioritized work is completed

=== Project Work (OKR) ===
All project work is prioritized and groomed beforehand. Overarching project tasks are created in Phabricator with subtasks, both of which will be tagged with a FY/ Quarter "milestone" to indicate scheduling for projects that span multiple quarters or years.

=== Maintenance (non-OKR) ===
Planned maintenance will follow the same workflow as regular project work, unplanned maintenance or requests will be groomed and prioritized based on urgency and severity.

=== Work Cadence Summary ===
{| class="wikitable"
|Activity
|Frequency
|Where
|-
|Intake Grooming + Prioritization
|Weekly
|o11y office hours
|-
|Planning (rolling roadmap)
|Quarterly
|OKR Meetings
|-
|Annual Planning
|Yearly
|TBD
|}
</div></div>
*


[[Category: Monitoring]]
[[Category: Monitoring]]

Revision as of 15:39, 12 July 2021

SRE Observability

SRE Observability - Monitoring and Logging (Prometheus/Grafana and ElasticSearch, plus some Kafka).

The Observability team, or "o11y" for short, works across SRE and Technology to provide teams with tools, platforms and insights into how systems and services are performing. It leverages technologies such as Grafana, Kibana/Logstash, Prometheus, AlertManager and more.

How we work.

Overview

The Observability team (often referred to as o11y) maintains several tools while curating and building a collaborative roadmap for the Wikimedia Foundation. Maintaining several of these work streams provide challenges as there is work that is adjacent, related, and directly assigned to the Observability team workboard but is not easily distinguishable from each other.

The purpose of this document is to describe:

  • How the #o11y team does its work
  • How other teams can request assistance or time on our roadmap
  • How we will surface scheduling dependencies
  • How we will use Phabricator to boost visibility
  • What tools we intend to use

How to connect (and why)

  • Reach out:
    • IRC #wikimedia-sre-observability
    • Email [[1]]
    • Phabricator tags:
      • sre-observability, observability, o11y, wikimedia-logstash, …
    • VictorOps in case of emergencies

How we Organize our Work

Work comes from multiple sources, but most requests should land in Phabricator. The #observability tag/project is the tag we use for all incoming work. Tasks will be one of six major states: inbox, backlog, scheduled, in progress, radar, done/closed.

  • Inbox: this is the default state, all work lands in #observability inbox
  • Backlog: work is accepted but unscheduled (we don't know when it will be worked on, but we know its a good thing to do)
  • Scheduled: work is scheduled to do (aka prioritized) we know what needs to be done, who will do it and when it will be done, this will be manifested as a FY-Q# tag (FY2021/2022 Q1 milestone tag or similar), plus an inbox for that tag or an UP Next for that milestone.
  • In progress: this indicates that the work is in progress and actively being worked on.
  • Radar: this is work we consider we want to be informed but will not be on our workboard this is only for the #observability component tag kept for visibility purposes.
  • Done. this will be a relative state tagged in a done column under a specific milestone when it gets completed.

The Observability team grooms incoming tasks on a weekly basis normally during planning meetings on Monday at 8:00 AM Pacific. Some requests may receive an out of band prioritization effort.

This happens by reviewing the inbox for the #observability (component) workboard and the #sre-observability (group) workboard. (Inbox for a quarter or UP next)

From there the task or request should go through a quick prioritization of "done this quarter" i.e. time sensitive, or backlog if the task is actionable for the Observability team. Otherwise the task goes to radar or is blocked in the backlog if unable to move forward. Tasks which do not have enough information provided to groom will receive a follow-up comment and remain unprioritized in the general backlog until enough information has been collected to effectively perform the task.

Phabricator Workflow

  • Tasks go into the Observability intake project (#observability)
  • Tasks are then groomed and tagged with the appropriate subcomponent (subproject) area; one of the four:
    • o11y-alerting
    • o11y-metrics
    • o11y-logging
    • o11y-tracing* (tbd later)
  • Each of these areas is a subproject and prioritized accordingly to manage a product roadmap for each of these main tracks of work
  • Tasks then get tagged into a milestone for execution
  • Then tasks are either scheduled as up next or scheduled to a specific quarter
  • The process is there iterated on periodically (weekly,quarterly, yearly)

How we Plan our Roadmap

Roadmap planning will be a rolling 1+ year roadmap with the goal to have a list of tasks pre-groomed and prioritized periodically (quarterly).

There are 6 major work categories that drive efforts of the o11y team:

  • Alerting
  • Metrics
  • Logs
  • Tracing (future)
  • Maintenance/Incidents
  • Incident Management

The goal of this process is to:

  • Drive each of these major workstreams
  • Clearly set goals and deliverables
  • Quantify effort and time investment per workstream, according to the level of investment that the organization is interested in per each individual initiative, and
  • Allocate the adequate amount of time to each initiative.

Prioritization

This is both a scheduled and a continual effort in sizing up work and importance/impact of specific work streams. The team is employing a simple forced rank list of priorities that are fed from the intake process and groomed by the team. This effort in turn is then taken to a spreadsheet where these projects are scored for overall feel on value and capacity.

Order of presence for prioritization:

  • Tier1 (high): incidents, security events, privacy concerns, PII in logs, unbreak now events
  • Tier2 (medium): project work (OKR), outside requests, maintenances
  • Tier3 (low): non critical maintenances
  • Tier4 (lowest): work to be done when all other prioritized work is completed

Project Work (OKR)

All project work is prioritized and groomed beforehand. Overarching project tasks are created in Phabricator with subtasks, both of which will be tagged with a FY/ Quarter "milestone" to indicate scheduling for projects that span multiple quarters or years.

Maintenance (non-OKR)

Planned maintenance will follow the same workflow as regular project work, unplanned maintenance or requests will be groomed and prioritized based on urgency and severity.

Work Cadence Summary

Activity Frequency Where
Intake Grooming + Prioritization Weekly o11y office hours
Planning (rolling roadmap) Quarterly OKR Meetings
Annual Planning Yearly TBD