SRE/SRE Team requests: Difference between revisions

From Wikitech
< SRE
Content deleted Content added
No edit summary
update "operations" to "SRE" since we've renamed the phab project; updated the URLs too for consistency even though both versions work
Line 12: Line 12:


== Phabricator ==
== Phabricator ==
The majority of operations requests should be filed within the [https://phabricator.wikimedia.org/ Wikimedia Phabricator] installation using the #operations project tag.
The majority of operations requests should be filed within the [https://phabricator.wikimedia.org/ Wikimedia Phabricator] installation using the #SRE project tag.
* If you keep the default priority to 'Needs Triage' and it is in the <code>#Operations</code> project, our [[SRE Clinic Duty]] assignee for the week will triage your request.
* If you keep the default priority to 'Needs Triage' and it is in the <code>#SRE</code> project, our [[SRE Clinic Duty]] assignee for the week will triage your request.
* This [https://phabricator.wikimedia.org/maniphest/task/create/?projects=operations link] will create a task in the operations project.
* This [https://phabricator.wikimedia.org/maniphest/task/create/?projects=sre link] will create a task in the SRE project.
If you further refine your request using the below instructions, it will usually result in faster triage.
If you further refine your request using the below instructions, it will usually result in faster triage.


Line 25: Line 25:
** Example: Volunteer transferring domain to WMF control.
** Example: Volunteer transferring domain to WMF control.
** Example: Incoming domains needing implementation/support on cluster.
** Example: Incoming domains needing implementation/support on cluster.
* This [https://phabricator.wikimedia.org/maniphest/task/create/?projects=domains,operations link] will create a task in the domain & operations projects.
* This [https://phabricator.wikimedia.org/maniphest/task/create/?projects=domains,sre link] will create a task in the Domains & SRE projects.
** It is advised that you leave your priority as 'Needs Triage' and not assign it to a specific person. This will result in it showing in the top of the operations triage lists.
** It is advised that you leave your priority as 'Needs Triage' and not assign it to a specific person. This will result in it showing in the top of the SRE triage lists.
* If you are requesting that Wikimedia register a domain that is currently unregistered, you will want to select option Security: Other confidential issue.
* If you are requesting that Wikimedia register a domain that is currently unregistered, you will want to select option Security: Other confidential issue.
** This allows you, plus the [https://phabricator.wikimedia.org/tag/wmf-nda/ wmf-nda] to view the task, but not the entire internet; requesting we register an unregistered domain in an open task is a nice way to let squatters know what to register.
** This allows you, plus the [https://phabricator.wikimedia.org/tag/wmf-nda/ wmf-nda] to view the task, but not the entire internet; requesting we register an unregistered domain in an open task is a nice way to let squatters know what to register.
Line 35: Line 35:
** Full testing of service & puppetization within [https://wikitech.wikimedia.org/wiki/Help:FAQ Wikimedia Cloud VPS]
** Full testing of service & puppetization within [https://wikitech.wikimedia.org/wiki/Help:FAQ Wikimedia Cloud VPS]
** Documentation of service and use on [https://wikitech.wikimedia.org Wikitech].
** Documentation of service and use on [https://wikitech.wikimedia.org Wikitech].
* You can click the link above to pre-populate a hardware request ticket with the basic fields for entry. Please also include the operations and hardware-requests projects. These include:
* You can click the link above to pre-populate a hardware request ticket with the basic fields for entry. Please also include the SRE and hardware-requests projects. These include:
** Cloud Project Tested, Site/Location, Number of systems, Service, Networking Requirements like access to specific networks, Processor Requirements, Memory, Disks:, NIC(s), Partitioning Scheme, and any other relevant notes/info.
** Cloud Project Tested, Site/Location, Number of systems, Service, Networking Requirements like access to specific networks, Processor Requirements, Memory, Disks:, NIC(s), Partitioning Scheme, and any other relevant notes/info.
* Note that SRE might suggest using a VM instead if we deem it applicable.
* Note that SRE might suggest using a VM instead if we deem it applicable.
Line 42: Line 42:


=== Virtual machine requests (Production) ===
=== Virtual machine requests (Production) ===
TL;DR. Click [https://phabricator.wikimedia.org/maniphest/task/create/?title=Site:%20(QUANTITY)%20VM%20%request%20for%20SERVICE%5bS%5d&projects=operations,vm-requests&description=Cloud%20VPS%20Project%20Tested%3A%20%3Cproject_name%3E%0ASite%2FLocation%3A%3CEQIAD%7CCODFW%3E%20%0ANumber%20of%20systems%3A%20%3C%23%20of%20VMs%3E%20%0AService%3A%20%3Cservice%20name%3E%0ANetworking%20Requirements%3A%20%3Cinternal%7Cexternal%20IP%3E%2C%20%3Cspecific%20networking%20access%20needed%3E%20%0AProcessor%20Requirements%3A%20%3CNumber%20of%20Virtual%20CPUS%3E%0AMemory%3A%20%0ADisks%3A%20%3CCapacity%20only%3E%0AOther%20Requirements%3A%20%0A VM Requests] and fill in the form. But please do read the following.
TL;DR. Click [https://phabricator.wikimedia.org/maniphest/task/create/?title=Site:%20(QUANTITY)%20VM%20%request%20for%20SERVICE%5bS%5d&projects=sre,vm-requests&description=Cloud%20VPS%20Project%20Tested%3A%20%3Cproject_name%3E%0ASite%2FLocation%3A%3CEQIAD%7CCODFW%3E%20%0ANumber%20of%20systems%3A%20%3C%23%20of%20VMs%3E%20%0AService%3A%20%3Cservice%20name%3E%0ANetworking%20Requirements%3A%20%3Cinternal%7Cexternal%20IP%3E%2C%20%3Cspecific%20networking%20access%20needed%3E%20%0AProcessor%20Requirements%3A%20%3CNumber%20of%20Virtual%20CPUS%3E%0AMemory%3A%20%0ADisks%3A%20%3CCapacity%20only%3E%0AOther%20Requirements%3A%20%0A VM Requests] and fill in the form. But please do read the following.
* THIS IS NOT [[Portal:Toolforge|TOOLFORGE]].
* THIS IS NOT [[Portal:Toolforge|TOOLFORGE]].
* This is for requesting a virtual machine in the production cluster. (This is usually as an alternative to a bare metal server.)
* This is for requesting a virtual machine in the production cluster. (This is usually as an alternative to a bare metal server.)
Line 60: Line 60:
=== Mail aliases ===
=== Mail aliases ===
* Please note that mail aliases are not handled by SRE anymore. Mail aliases under the wikimedia.org domain are handled by the [https://office.wikimedia.org/wiki/Office_IT OIT team]. Please send a mail to techsupport@wikimedia.org to request one. Please note that if you are not staff, and require a mail alias, you should request it via your working group/team leads/technical mentor/staff.
* Please note that mail aliases are not handled by SRE anymore. Mail aliases under the wikimedia.org domain are handled by the [https://office.wikimedia.org/wiki/Office_IT OIT team]. Please send a mail to techsupport@wikimedia.org to request one. Please note that if you are not staff, and require a mail alias, you should request it via your working group/team leads/technical mentor/staff.
** Only if you need an alias in another domain besides wikimedia.org or have a specific reason that you need it to trigger before Google routing, create a [https://phabricator.wikimedia.org/maniphest/task/create/?projects=operations SRE request] in the Operations project.
** Only if you need an alias in another domain besides wikimedia.org or have a specific reason that you need it to trigger before Google routing, create an [https://phabricator.wikimedia.org/maniphest/task/create/?projects=sre SRE request] in the SRE project.
** If you have an existing exim mail aliases handled by SRE you are encouraged to move it by requesting the same from OIT and telling SRE to delete the existing one on their side. This would be part of [https://phabricator.wikimedia.org/T122144 T122144]. Thanks!
** If you have an existing exim mail aliases handled by SRE you are encouraged to move it by requesting the same from OIT and telling SRE to delete the existing one on their side. This would be part of [https://phabricator.wikimedia.org/T122144 T122144]. Thanks!


Line 81: Line 81:
** SRE involvement is typically only required when a list administrator is not listed on the list information page, or if the administrator has become unavailable for the role.
** SRE involvement is typically only required when a list administrator is not listed on the list information page, or if the administrator has become unavailable for the role.
** We will NOT simply change list owners; all attempts to handle the request via the usual means/admins must be exhausted. We will attempt to also contact the list administrator before we change anything.
** We will NOT simply change list owners; all attempts to handle the request via the usual means/admins must be exhausted. We will attempt to also contact the list administrator before we change anything.
** If you still want SRE assistance, please file a task with both the #operations & #Wikimedia-Mailing-lists projects.
** If you still want SRE assistance, please file a task with both the #sre & #Wikimedia-Mailing-lists projects.
*** This [https://phabricator.wikimedia.org/maniphest/task/create/?projects=Wikimedia-Mailing-lists%20operations link] will create a task in both operations & Wikimedia-Mailing-lists projects.
*** This [https://phabricator.wikimedia.org/maniphest/task/create/?projects=Wikimedia-Mailing-lists%20sre link] will create a task in both operations & Wikimedia-Mailing-lists projects.


=== Patch review ===
=== Patch review ===
* Any patches that require an SRE team member review should have a [https://phabricator.wikimedia.org/ Phabricator] task and have both the operations and Patch-For-Review project tags assigned to it.
* Any patches that require an SRE team member review should have a [https://phabricator.wikimedia.org/ Phabricator] task and have both the SRE and Patch-For-Review project tags assigned to it.
** Please do not assign a specific team member for review unless they are the subject matter expert (though CCing them if you are uncertain is valid); otherwise our [[SRE Clinic Duty]] assignee will attempt to triage to the appropriate parties.
** Please do not assign a specific team member for review unless they are the subject matter expert (though CCing them if you are uncertain is valid); otherwise our [[SRE Clinic Duty]] assignee will attempt to triage to the appropriate parties.
** This [https://phabricator.wikimedia.org/maniphest/task/create/?projects=Patch-For-Review%20operations link] will create a task in the Patch-For-Review project.
** This [https://phabricator.wikimedia.org/maniphest/task/create/?projects=Patch-For-Review%20sre link] will create a task in the Patch-For-Review project.
* The [[Puppet request window]] takes place twice weekly. Simple patches can be included during this window, please see the page for further details.
* The [[Puppet request window]] takes place twice weekly. Simple patches can be included during this window, please see the page for further details.


=== Network configuration ===
=== Network configuration ===
* Network requests (router configuration, switch port descriptions, vlan assignments, etc) should have a [https://phabricator.wikimedia.org/ Phabricator] task and have both the operations and network project tags assigned to it.
* Network requests (router configuration, switch port descriptions, vlan assignments, etc) should have a [https://phabricator.wikimedia.org/ Phabricator] task and have both the SRE and network project tags assigned to it.
** Please do not assign a specific team member for review unless they are the subject matter expert (though CCing them if you are uncertain is valid); otherwise our [[SRE Clinic Duty]] assignee will attempt to triage to the appropriate parties.
** Please do not assign a specific team member for review unless they are the subject matter expert (though CCing them if you are uncertain is valid); otherwise our [[SRE Clinic Duty]] assignee will attempt to triage to the appropriate parties.
** This [https://phabricator.wikimedia.org/maniphest/task/create/?projects=network%20operations link] will create a task with the operations and network projects associated with it.
** This [https://phabricator.wikimedia.org/maniphest/task/create/?projects=network%20sre link] will create a task with the SRE and network projects associated with it.
* Subnets/VLANS are listed on the switches (not public accessible) and in our [https://phabricator.wikimedia.org/diffusion/ODNS/ operations/dns] git repo (public accessible).
* Subnets/VLANS are listed on the switches (not public accessible) and in our [https://phabricator.wikimedia.org/diffusion/ODNS/ operations/dns] git repo (public accessible).



Revision as of 18:08, 5 May 2021

Here's what you can do if you need help from the Wikimedia Site Reliability Engineering team or one of its sub teams: Datacenter Operations, Data Persistence, Infrastructure Foundations, Observability, Service Operations and Traffic. SRE used to be called Ops or TechOps before and that naming convention is still used in a number of places, for example in the IRC channel names.

Urgent issues

Urgent issues are generally imminent risks to site security, like compromised SSH keys.

The preferred way to contact SRE in such an emergency is to use our Klaxon app. Klaxon also shows you open alerts to see if somebody else already alerted SRE.

If you are unable to access or use Klaxon, you may choose one of the below alternatives. In that case, make sure to keep at it until you get confirmation that a member of the SRE team has received the message.

  • #wikimedia-operations IRC channel (in an emergency, consider using the hotword #page to get attention)
  • Phone (Foundation staff members have access to the contact list on Office wiki)

Phabricator

The majority of operations requests should be filed within the Wikimedia Phabricator installation using the #SRE project tag.

  • If you keep the default priority to 'Needs Triage' and it is in the #SRE project, our SRE Clinic Duty assignee for the week will triage your request.
  • This link will create a task in the SRE project.

If you further refine your request using the below instructions, it will usually result in faster triage.

Access requests

Domain requests

  • This project #domains is for all domain registration requests, nameserver updates, and anything involving a domain registrar.
    • Example: Volunteer transferring domain to WMF control.
    • Example: Incoming domains needing implementation/support on cluster.
  • This link will create a task in the Domains & SRE projects.
    • It is advised that you leave your priority as 'Needs Triage' and not assign it to a specific person. This will result in it showing in the top of the SRE triage lists.
  • If you are requesting that Wikimedia register a domain that is currently unregistered, you will want to select option Security: Other confidential issue.
    • This allows you, plus the wmf-nda to view the task, but not the entire internet; requesting we register an unregistered domain in an open task is a nice way to let squatters know what to register.

Hardware requests

  • TL;DR Click Hardware requests to file a task for requesting hardware. But please read the following.
  • Requesting a server for your service should only occur after the following:
  • You can click the link above to pre-populate a hardware request ticket with the basic fields for entry. Please also include the SRE and hardware-requests projects. These include:
    • Cloud Project Tested, Site/Location, Number of systems, Service, Networking Requirements like access to specific networks, Processor Requirements, Memory, Disks:, NIC(s), Partitioning Scheme, and any other relevant notes/info.
  • Note that SRE might suggest using a VM instead if we deem it applicable.
  • Note that you don't place server requests on our procurement project.
    • A single #hardware-request can generate multiple #procurement sub-tasks, as each sub-task could be pricing from a specific vendor.

Virtual machine requests (Production)

TL;DR. Click VM Requests and fill in the form. But please do read the following.

  • THIS IS NOT TOOLFORGE.
  • This is for requesting a virtual machine in the production cluster. (This is usually as an alternative to a bare metal server.)
  • VMs are great for hardware sharing, increasing hardware usage. If your service does not have specific hardware requirements, a VM is an ideal candidate for it. But if it has critical performance requirements, it might very well not be.
  • Requesting a server for your service should only occur after the following:
  • You can click the link above to pre-populate a vm-request ticket with the basic fields for entry. These include:
    • Cloud VPS Project Tested, Site/Location, Service, Networking Requirements, Processor Requirements, Memory, Disks, and any other relevant notes/info.
  • Do note that SRE might suggest using bare metal hardware instead if we deem it necessary
  • Networking wise, multiple NICs for throughput increase is not viable in a VM
  • Disk performance is limited by the underlying technology and resource sharing.

Other Purchases: SSL Certificates, Support Contracts

  • All other requests for SRE purchasing of support contracts, ssl certificates, and other related items should be placed in the Procurement project

Mail aliases

  • Please note that mail aliases are not handled by SRE anymore. Mail aliases under the wikimedia.org domain are handled by the OIT team. Please send a mail to techsupport@wikimedia.org to request one. Please note that if you are not staff, and require a mail alias, you should request it via your working group/team leads/technical mentor/staff.
    • Only if you need an alias in another domain besides wikimedia.org or have a specific reason that you need it to trigger before Google routing, create an SRE request in the SRE project.
    • If you have an existing exim mail aliases handled by SRE you are encouraged to move it by requesting the same from OIT and telling SRE to delete the existing one on their side. This would be part of T122144. Thanks!

Mailing lists

Creation requests

  • Please also see https://meta.wikimedia.org/wiki/Mailing_lists#Create_a_new_list
  • The SRE team doesn't create all mailing lists. Instead, you should file a general request under the Wikimedia-Mailing-lists project in Phabricator; please leave the priority as 'Needs Triage' for our our SRE Clinic Duty assignee to better notice it.
  • Please include the following:
    • requested name of the mailing list, ending in @lists.wikimedia.org
    • reasoning/explanation of purpose (and link to community consensus, if applicable)
    • initial list administrator's email address
    • secondary list administrator's email address (as a backup)
    • description of the list for the list info page (should include even if private list so SRE and mailman admins know why it exists.)
    • Note if this should be public or private, and if archives should exist or not. (If list is private, archives should be private.)

Administration

  • General list administration is handled by an individual lists administrators; administrators can be viewed on the lists information page.
    • SRE involvement is typically only required when a list administrator is not listed on the list information page, or if the administrator has become unavailable for the role.
    • We will NOT simply change list owners; all attempts to handle the request via the usual means/admins must be exhausted. We will attempt to also contact the list administrator before we change anything.
    • If you still want SRE assistance, please file a task with both the #sre & #Wikimedia-Mailing-lists projects.
      • This link will create a task in both operations & Wikimedia-Mailing-lists projects.

Patch review

  • Any patches that require an SRE team member review should have a Phabricator task and have both the SRE and Patch-For-Review project tags assigned to it.
    • Please do not assign a specific team member for review unless they are the subject matter expert (though CCing them if you are uncertain is valid); otherwise our SRE Clinic Duty assignee will attempt to triage to the appropriate parties.
    • This link will create a task in the Patch-For-Review project.
  • The Puppet request window takes place twice weekly. Simple patches can be included during this window, please see the page for further details.

Network configuration

  • Network requests (router configuration, switch port descriptions, vlan assignments, etc) should have a Phabricator task and have both the SRE and network project tags assigned to it.
    • Please do not assign a specific team member for review unless they are the subject matter expert (though CCing them if you are uncertain is valid); otherwise our SRE Clinic Duty assignee will attempt to triage to the appropriate parties.
    • This link will create a task with the SRE and network projects associated with it.
  • Subnets/VLANS are listed on the switches (not public accessible) and in our operations/dns git repo (public accessible).

Schema changes

  • Schema changes on production databases have to be approved and applied by DBAs. Instructions on how to request its application are on the Schema change page.
    • Please do not assign a specific team member for review (though CCing them if you are uncertain is valid); our SRE Clinic Duty assignee will attempt to triage to the appropriate parties.
    • This link will create a task in the #Blocked-on-schema-change and #DBA projects
    • Only use #Blocked-on-schema-change when the change is final, not while it is in progress/hasn't been reviewed
  • Normal schema changes can take up to 2 weeks to take effect. Those involving key tables such as revision, page or image may take more.

IRC

  • SRE team members idle in #wikimedia-operations on Freenode.
  • This is generally useful for vague questions or project planning, but non-ideal for hardware requests, access requests, or ongoing work.
    • If the request will result in work on the part of the SRE team member, a Phabricator task will be requested to track the work.
    • There is an SRE Clinic Duty assignee from the SRE team for every week.
    • Clinic duty person is listed in the topic for #wikimedia-operations, as well as on SRE Clinic Duty. This changes every Monday.
    • The clinic duty person can be pinged, and is the first point of contact in IRC for operations issues.
      • Please note that our SRE team works in multiple time zones, and the clinic assignee for any given week will likely be working within their own local time zone.

SRE mailing list

SRE team members are subscribed to ops@lists.wikimedia.org.