SRE/SRE Team requests: Difference between revisions

From Wikitech
< SRE
Content deleted Content added
Line 98: Line 98:
=== Schema changes ===
=== Schema changes ===
* Schema changes on production databases have to be approved and applied by DBAs. Instructions on how to request its application are on the [[Schema change]] page.
* Schema changes on production databases have to be approved and applied by DBAs. Instructions on how to request its application are on the [[Schema change]] page.
** Please do not assign a specific team member for review (though CCing them if you are uncertain is valid); otherwise our [[SRE Clinic Duty]] assignee will attempt to triage to the appropriate parties.
** Please do not assign a specific team member for review (though CCing them if you are uncertain is valid); our [[SRE Clinic Duty]] assignee will attempt to triage to the appropriate parties.
** This [https://phabricator.wikimedia.org/maniphest/task/create/?projects=DBA%20Blocked-on-schema-change link] will create a task in the #Blocked-on-schema-change and #DBA projects
** This [https://phabricator.wikimedia.org/maniphest/task/create/?projects=DBA%20Blocked-on-schema-change link] will create a task in the #Blocked-on-schema-change and #DBA projects
** Only use #Blocked-on-schema-change when the change is final, not while it is in progress/hasn't been reviewed
** Only use #Blocked-on-schema-change when the change is final, not while it is in progress/hasn't been reviewed

Revision as of 23:06, 11 March 2021

Here's what you can do if you need help from the Wikimedia Site Reliability Engineering team or one of its sub teams: Datacenter Operations, Data Persistence, Infrastructure Foundations, Observability, Service Operations and Traffic

Urgent issues

Urgent issues are generally imminent risks to site security, like compromised SSH keys.

The preferred way to contact SRE in such an emergency is to use our Klaxon app. Klaxon also shows you open alerts to see if somebody else already alerted SRE.

If you are unable to access or use Klaxon, you may choose one of the below alternatives. In that case, make sure to keep at it until you get confirmation that a member of the SRE team has received the message.

  • #wikimedia-operations IRC channel (in an emergency, consider using the hotword #page to get attention)
  • Phone (Foundation staff members have access to the contact list on Office wiki)

Phabricator

The majority of operations requests should be filed within the Wikimedia Phabricator installation using the #operations project tag.

  • If you keep the default priority to 'Needs Triage' and it is in the #Operations project, our SRE Clinic Duty assignee for the week will triage your request.
  • This link will create a task in the operations project.

If you further refine your request using the below instructions, it will usually result in faster triage.

Access requests

Domain requests

  • This project #domains is for all domain registration requests, nameserver updates, and anything involving a domain registrar.
    • Example: Volunteer transferring domain to WMF control.
    • Example: Incoming domains needing implementation/support on cluster.
  • This link will create a task in the domain & operations projects.
    • It is advised that you leave your priority as 'Needs Triage' and not assign it to a specific person. This will result in it showing in the top of the operations triage lists.
  • If you are requesting that Wikimedia register a domain that is currently unregistered, you will want to select option Security: Other confidential issue.
    • This allows you, plus the wmf-nda to view the task, but not the entire internet; requesting we register an unregistered domain in an open task is a nice way to let squatters know what to register.

Hardware requests

  • TL;DR Click Hardware requests to file a task for requesting hardware. But please read the following.
  • Requesting a server for your service should only occur after the following:
  • You can click the link above to pre-populate a hardware request ticket with the basic fields for entry. Please also include the operations and hardware-requests projects. These include:
    • Cloud Project Tested, Site/Location, Number of systems, Service, Networking Requirements like access to specific networks, Processor Requirements, Memory, Disks:, NIC(s), Partitioning Scheme, and any other relevant notes/info.
  • Note that SRE might suggest using a VM instead if we deem it applicable.
  • Note that you don't place server requests on our procurement project.
    • A single #hardware-request can generate multiple #procurement sub-tasks, as each sub-task could be pricing from a specific vendor.

Virtual machine requests (Production)

TL;DR. Click VM Requests and fill in the form. But please do read the following.

  • THIS IS NOT TOOLFORGE.
  • This is for requesting a virtual machine in the production cluster. (This is usually as an alternative to a bare metal server.)
  • VMs are great for hardware sharing, increasing hardware usage. If your service does not have specific hardware requirements, a VM is an ideal candidate for it. But if it has critical performance requirements, it might very well not be.
  • Requesting a server for your service should only occur after the following:
  • You can click the link above to pre-populate a vm-request ticket with the basic fields for entry. These include:
    • Cloud VPS Project Tested, Site/Location, Service, Networking Requirements, Processor Requirements, Memory, Disks, and any other relevant notes/info.
  • Do note that SRE might suggest using bare metal hardware instead if we deem it necessary
  • Networking wise, multiple NICs for throughput increase is not viable in a VM
  • Disk performance is limited by the underlying technology and resource sharing.

Other Purchases: SSL Certificates, Support Contracts

  • All other requests for SRE purchasing of support contracts, ssl certificates, and other related items should be placed in the Procurement project

Mail aliases

  • Please note that mail aliases are not handled by SRE anymore. Mail aliases under the wikimedia.org domain are handled by the OIT team. Please send a mail to techsupport@wikimedia.org to request one. Please note that if you are not staff, and require a mail alias, you should request it via your working group/team leads/technical mentor/staff.
    • Only if you need an alias in another domain besides wikimedia.org or have a specific reason that you need it to trigger before Google routing, create a SRE request in the Operations project.
    • If you have an existing exim mail aliases handled by SRE you are encouraged to move it by requesting the same from OIT and telling SRE to delete the existing one on their side. This would be part of T122144. Thanks!

Mailing lists

Creation requests

  • Please also see https://meta.wikimedia.org/wiki/Mailing_lists#Create_a_new_list
  • The SRE team doesn't create all mailing lists. Instead, you should file a general request under the Wikimedia-Mailing-lists project in Phabricator; please leave the priority as 'Needs Triage' for our our SRE Clinic Duty assignee to better notice it.
  • Please include the following:
    • requested name of the mailing list, ending in @lists.wikimedia.org
    • reasoning/explanation of purpose (and link to community consensus, if applicable)
    • initial list administrator's email address
    • secondary list administrator's email address (as a backup)
    • description of the list for the list info page (should include even if private list so SRE and mailman admins know why it exists.)
    • Note if this should be public or private, and if archives should exist or not. (If list is private, archives should be private.)

Administration

  • General list administration is handled by an individual lists administrators; administrators can be viewed on the lists information page.
    • SRE involvement is typically only required when a list administrator is not listed on the list information page, or if the administrator has become unavailable for the role.
    • We will NOT simply change list owners; all attempts to handle the request via the usual means/admins must be exhausted. We will attempt to also contact the list administrator before we change anything.
    • If you still want SRE assistance, please file a task with both the #operations & #Wikimedia-Mailing-lists projects.
      • This link will create a task in both operations & Wikimedia-Mailing-lists projects.

Patch review

  • Any patches that require an SRE team member review should have a Phabricator task and have both the operations and Patch-For-Review project tags assigned to it.
    • Please do not assign a specific team member for review unless they are the subject matter expert (though CCing them if you are uncertain is valid); otherwise our SRE Clinic Duty assignee will attempt to triage to the appropriate parties.
    • This link will create a task in the Patch-For-Review project.
  • The Puppet request window takes place twice weekly. Simple patches can be included during this window, please see the page for further details.

Network configuration

  • Network requests (router configuration, switch port descriptions, vlan assignments, etc) should have a Phabricator task and have both the operations and network project tags assigned to it.
    • Please do not assign a specific team member for review unless they are the subject matter expert (though CCing them if you are uncertain is valid); otherwise our SRE Clinic Duty assignee will attempt to triage to the appropriate parties.
    • This link will create a task with the operations and network projects associated with it.
  • Subnets/VLANS are listed on the switches (not public accessible) and in our operations/dns git repo (public accessible).

Schema changes

  • Schema changes on production databases have to be approved and applied by DBAs. Instructions on how to request its application are on the Schema change page.
    • Please do not assign a specific team member for review (though CCing them if you are uncertain is valid); our SRE Clinic Duty assignee will attempt to triage to the appropriate parties.
    • This link will create a task in the #Blocked-on-schema-change and #DBA projects
    • Only use #Blocked-on-schema-change when the change is final, not while it is in progress/hasn't been reviewed
  • Normal schema changes can take up to 2 weeks to take effect. Those involving key tables such as revision, page or image may take more.

IRC

  • SRE team members idle in #wikimedia-operations on Freenode.
  • This is generally useful for vague questions or project planning, but non-ideal for hardware requests, access requests, or ongoing work.
    • If the request will result in work on the part of the SRE team member, a Phabricator task will be requested to track the work.
    • There is an SRE Clinic Duty assignee from the SRE team for every week.
    • Clinic duty person is listed in the topic for #wikimedia-operations, as well as on SRE Clinic Duty. This changes every Monday.
    • The clinic duty person can be pinged, and is the first point of contact in IRC for operations issues.
      • Please note that our SRE team works in multiple time zones, and the clinic assignee for any given week will likely be working within their own local time zone.

Ops mailing list

SRE team members are subscribed to ops@lists.wikimedia.org.