SRE/SRE Clinic Duty
< SRE
For getting assistance from the SRE team, see SRE/SRE Team requests.
The SRE Clinic Duty triage duty was established to ensure that tickets (and thus requests and projects) are triaged and processed in a timely fashion, providing feedback and regular updates to SRE-supported projects/responsibilities.
This is a duty that is fulfilled by a member of the Wikimedia SRE team.
Roster
Schedule
Week startingClinician/backupTeam
2021-01-04Giuseppe LavagettoSRE-Service Operations
2021-01-11Arzhel YounsiSRE-Infrastructure Foundations
2021-01-18Jaime CrespoSRE-Data Persistence
2021-01-25Kunal MehtaSRE-Service Operations
2021-02-01Chris DanisSRE-Infrastructure Foundations
2021-02-08Valentín GutierrezSRE-Traffic
2021-02-15Moritz MühlenhoffSRE-Infrastructure Foundations
2021-02-22John BondSRE-Infrastructure Foundations
2021-03-01Janis MeybohmSRE-Service Operations
2021-03-08Cas RusnovSRE-Infrastructure Foundations
2021-03-15Riccardo CoccioliSRE-Infrastructure Foundations
2021-03-22Stevie Beth MhaolSRE-Data Persistence
2021-03-29Effie MouzeliSRE-Service Operations
2021-04-05Emanuele RoccaSRE-Traffic
2021-04-12Filippo GiunchediSRE-Observability
2021-04-19Alexandros KosiarisSRE-Service Operations
2021-04-26Rob HalsellSRE-Data Center Operations
2021-05-03Daniel ZahnSRE-Service Operations
2021-05-10Arzhel YounsiSRE-Infrastructure Foundations
2021-05-17Brandon BlackSRE-Traffic
2021-05-24Manuel ArósteguiSRE-Data Persistence
2021-05-31Cole WhiteSRE-Observability
2021-06-07Riccardo CoccioliSRE-Infrastructure Foundations
2021-06-14Sukhbir SinghSRE-Traffic
2021-06-21John BondSRE-Infrastructure Foundations
2021-06-28Keith HerronSRE-Observability
2021-07-05No clinic duty this week
2021-07-12Valentín GutierrezSRE-Traffic
2021-07-19Reuven LazarusSRE-Service Operations
2021-07-26Kunal MehtaSRE-Service Operations
2021-08-02Moritz MühlenhoffSRE-Infrastructure Foundations
2021-08-09Emanuele RoccaSRE-Traffic
2021-08-16Rob HalsellSRE-Data Center Operations
2021-08-23Jaime CrespoSRE-Data Persistence
2021-08-30Filippo GiunchediSRE-Observability
2021-09-06Alexandros KosiarisSRE-Service Operations
2021-09-13Cathal Mooney/Arzhel YounsiSRE-Infrastructure Foundations
2021-09-20Manuel ArósteguiSRE-Data Persistence
2021-09-27Giuseppe LavagettoSRE-Service Operations
2021-10-04Stevie ShirleySRE-Data Persistence
2021-10-11Chris DanisSRE-Infrastructure Foundations
2021-10-18Daniel ZahnSRE-Service Operations
2021-10-25Sukhbir SinghSRE-Traffic
2021-11-01Effie MouzeliSRE-Service Operations
2021-11-08Cole WhiteSRE-Observability
2021-11-15JW/Janis MeybohmSRE-Service Operations
2021-11-22Marc Mandere/Brandon BlackSRE-Traffic
2021-11-29Keith HerronSRE-Observability
2021-12-06Valentín GutierrezSRE-Traffic
2021-12-13MV/Manuel ArósteguiSRE-Data Persistence
2021-12-20Reuven LazarusSRE-Service Operations
2021-12-27No clinic duty this week
Parameters
Schedule
Folks will follow up with the person on SRE Clinic Duty about existing tasks, as well as how to create new ones.
As the person on clinic duty you are welcome to join #wikimedia-clinic for assistance while carrying out your shift
Hand-off / Takeover
Ideally all phabricator tasks are replied/commented upon in the process of reviewing and triaging, so no actual handoff of duties is required between weeks
Update the topic in IRC channel #wikimedia-operations, section 'SRE Clinic Duty:' with the person's name for that week.
The topic on IRC and this page are currently the public facing methods of determining who is on duty.
Exemptions
Typically this would follow Responsibilities, but it is a much shorter list:
Clinic duty should not triage/escalate/work tasks in the S4 #procurement projects as part of clinic duty.
These have a lot of out of phabricator communications with vendors/engineers/finance and thus handled by Rob or Willy.
Responsibilities
All incoming Clinic Duty tasks in phabricator can be viewed on the SRE Clinic Duty Dashboard
Review incoming tasks
Maintain the 'ops-maintenance' mails and calendar
[1]: You should have access either through individual membership or inherited permissions from being a member of the "sre" group. If not, ask an existing member to add you, they should have the permissions to do so even if not owner/manager of the group. (Only add other SRE folks). Being a member gives you permissions to do things, it does _not_ necessarily mean you are also receiving emails to your personal inbox. It's entirely up to you whether you like to receive those mails in your personal inbox or just use the web interface while you're on duty.
[2]: Sometimes this doesn't seem to refresh and marked posts are not disappearing from your view immediately. If this happens, removing the filter and applying it again helps.
[3]: If you are not able to create events, ask an SRE to add you (calendar settings => share this calendar).
[4]: You probably want to add the GMT (not daylight) timezone to your calendar (calendar settings => general => add a timezone). In this way you'll be able to specify the correct timezone when creating events for planned maintenance (usually they are announced with UTC dates).
[5]: No action is needed if it's a duplicate/reminder for an event that has already been added to calendar, if it's just an "FYI" kind of mail like "reason for outage", simple spam or anything else that doesn't warrant a calendar entry.
[6]: Copy the important part of the subject line or the summary and use it as the event title. If the mail contains important information like a circuit ID or details on what is affected, paste them into the body part of the calendar event. It's usually good enough to just use "all day" accuracy instead of taking the time to add exact start and end date and converting timezones because we are adding the link back from calendar to the full post with all the details. You don't need to worry about changing subjects or date formats anymore since posts will be sorted by date anyways. You also don't need to reply with a "added to calendar" message anymore and there are no other status changes, just "action needed" or not (done).
[7] It doesn't matter whether you added it to the calendar or determined it can be skipped, in either case _now_ there is "no action needed" (after you're done). We do it this way and don't use the "completed" status because the way Google groups works it forces you to actually _reply_ to a mail until it can be completed. We don't need that, that would just add unnecessary clicks and mail. Since both "no action needed" and "completed" are just different kinds of "resolution status" and the filter is based on "not resolved" the end result is the same and it is much simpler for us to just use that button.
[8]: WARNING: Jaime realized that marking "no action needed" on the Google Group may mark later followups on the same thread, too. While followup are normally reminders, sometimes they are also meaningful updates and cancellations. I would recommend reading all new emails on the clinic duty window to not miss those updates.
Be a first contact
Follow up with ticket owners and requestors as needed on old tickets to resolve, re-assign, or escalate as needed.
Read mail to root@
Triage emails sent to root@ (if you don't receive them, you need to add your alias in the private repo). If you see a recurrent issue, please open a sub-task to T132324 and try to notify whoever you think can contribute to the task. Review the outstanding sub-tasks and follow up as needed.
misc
Try to improve the manual below.
Tips
There is a clinic duty dashboard for Phabricator
You can search "to:alerts@wikimedia.org" in gmail to see all things that have paged people, independent of timezones and individual settings. This is used to fill the "pages for awareness"-section in the SRE meeting document.
Manual
This is a manual for the current "SRE on duty" in charge of triaging the Phabricator #SRE project.
How to handle IRC requests
If somebody asks you to do something via IRC, if reasonable, politely ask requestor to turn their request into a Phabricator ticket and add the SRE tag to it.
If you suspect the issue could be related to a recent deployment or need further investigation by deployers or developers, on the Phabricator ticket, add the Wikimedia-production-error tag to it.
Common, small "#SRE" tickets
Phabricator Administration
Please note that overall phabricator administration is handled by release engineering. The SRE clinic duty person typically would only get involved if a file needed immediately deletion or some herald rule causing chaos.
If an SRE clinic duty person has to login, please do so by accessing the phabricator servers. These have role(phabricator) in site.pp, but are typically phab[12]001.
Once in the system, the admin account login can be generated via URL path, by running: sudo /srv/phab/phabricator/bin/auth recover admin The system will output a full url path for a one time login token as the Admin user. You can then navigate to the offending file or herald rule and delete it via the web ui.
See Phabricator#Administrative Commands for more information.
Mail aliases
note: SRE handles only role/group mail aliases, individual mail aliases are handled by ITS as outlined here [1]
note2: more recently many aliases have been moved from SRE to ITS, and the goal is definitely NOT to add any new ones on our side unless they are strictly SRE-internal like monitoring etc. you can help by moving even more over to ITS, see T122144
Go to the puppet master (puppetmaster1001), cd to /srv/private/modules/privateexim/files/ in the private repo, usually edit the file wikimedia.org (as root) and sudo git commit. This will create a mail to SRE about the commit, with your username automatically prepended to the commit message.
You can then run puppet on mx1001 and mx2001 to confirm your changes have been applied.
There are 3 types of domains:
a) domains that have their own alias file (wikimedia.org, wikipedia.org and a few others), you will find these files in /srv/private/modules/privateexim/files, just edit them there, sudo git commit, and presto!!!, as with any other change in the private repo.
b) domains that just link to wikimediafoundation.org. These are just symlinks and puppet generates them. If you need to add a new one or change links, go to /srv/private/modules/privateexim/manifests/mail.pp. You will find it in class exim::aliases::private and should be self-explanatory.
c) domains that link to another domain. currently just wikivoyage.de to .org, same as in b) but a separate definition in the puppet class.
It is nice to add the corresponding Phab ticket number in a comment near changed aliases. Experience shows that it can be quite handy to be able to quickly answer questions like when exactly something has been changed and who requested it. There is one file or symlink per domain name. 95% of the time the requests are just regarding the "wikimedia.org" file. In other cases make sure you check for possible symlinks and realize which domains you are actually changing when editing a specific file.
Mailman mailing lists
Public mailing lists should typically be requested through Phabricator tagged with "​Wikimedia-Mailing-lists​", and Phabricator-maintenance-bot will automatically add the SRE tag. Google mailing lists are managed by ITS. You know it's a mailman list if it's @lists.wikimedia.org. To check if an email address exists in Google you can do "exim4 -bt foo@wikimedia.org" on an MX server.
create a list
Follow the normal procedure to create a Mailman mailing list.
password reset
Another common task is requests for password resets, see the docs on Mailman#Reset_the_admin_password_of_a_list​.
disable a list
When you get a request to disable a mailman list, you just have to run a shell script on the list server, see Mailman#Disable_or_re-enable_a_mailing_list​. In addition it's nice if you login once using the master password and remove the former admins email addresses from the "list run by" field.
add/remove owners
From the list server (check puppet to see which host runs lists) you can change owners with withlist utility. The m.owner list contains a list of email addresses, for example for bug T220641:
root@fermium:/var/lib/mailman/bin# ./list_admins wikimania-program List: wikimania-program, Owners: itait@wikimedia.org root@fermium:/var/lib/mailman/bin# ./withlist -l wikimania-program Loading list wikimania-program (locked) The variable `m' is the wikimania-program MailList instance >>> m.owner ['itait@wikimedia.org'] >>> m.owner = ['icueva@wikimedia.org'] >>> m.owner ['icueva@wikimedia.org'] >>> m.Save() >>> Unlocking (but not saving) list: wikimania-program Finalizing
LDAP group changes
Access to a range of mostly web-based services is granted via the "wmf" and "nda" groups. The specific permissions are listed here: https://wikitech.wikimedia.org/wiki/LDAP_Groups The change should be tracked in a ticket.
Create/Get LDAP account
In order to add or update a a user's LDAP permissions, they will first need an LDAP account. This can be created by either:
In either case you will need to know the username (for Wikitech) or shell account name (for Toolforge) name used. You can also search ldap to try and find it: mwmaint1002$ ldapsearch -x mail=user@example.org
Update data.yaml
Check whether there's an existing entry in modules/admin/data/data.yaml​:
The entry should look something like the following:
exampleuser: ensure: present realname: Example User email: exampleuser@example.org expiry_date: 2038-01-19 expiry_contact: examplestaff@wikimedia.org
Modify LDAP groups
After having added the user to data.yaml, the change in LDAP can be done from one of the "mediawiki maintenance" hosts like mwmaint1002 (this will be automated in a subsequent step):
TIP: If a user has to be removed from special LDAP access, in most cases (e.g. contract termination) you may want to notify also @aklapper to remove/check Phabricator access on the same ticket.
For further instructions see Help:Access, LDAP and LDAP Groups.
wmde access
Anyone at Wikimedia Deutschland who wants to get added to the "wmde" LDAP group needs to sign an NDA with the Legal department of the WMF. Simply add @KFrancis to the Phab task and she'll deal with it.
In addition, the access to "wmde" needs to be approved by an engineering manager from Wikimedia Deutschland. You can add either of the four to the Phab task:
Access requests
Access and reasoning for requesting it are documented on Requesting shell access. Please read and understand entirely before processing any access requests, as this very brief summary documentation may not cover all required points in the linked page.
If a request asks for things like new shell accounts, access to additional servers, log files, personal data, admin roles in systems like Mailman, Bugzilla, data center access, opening a firewall rule etc, then it is an access request and should be moved into the SRE-Access-Requests Project. Once the initial request is made, a number of follow up steps must be confirmed, all have been included in this /access request checklist
Analytics Groups
Deployment Groups
Requires Shell and must have approval from releng to be added to the deployment group
The user should also be added to the Gerrit group wmf-deployment.
Creating new shell users
Please see instructions in the puppet admin module's README.
Some notable changes since February 2017:
Renaming shell users
Sometimes we have to rename a shell user. This is typically when their shell name doesn't match their login name, and they have issues logging into items requiring LDAP credentials.
Renaming a user will require a few things happen, in a very specific order. Since many users keep data in their home directories, backups can sometimes be made, but not always. (Private data that isn't allowed to be copied off the cluster should not be backed up to laptops.) The existing username has to be removed from the host, since the new username will use the old username's UID.
IRC channel access
/query chanserv help access access #channel list access #channel add *!*@wikimedia/cloak
14:07 -ChanServ(ChanServ@services.)- Flags +Aiortv were set on ...
For people wanting to be a channel operator for #wikimedia-operations, first check they got nick protection enabled
/msg nickserv info <nick> ... <nick> has enabled nick protection
and then
/msg chanserv flags #wikimedia-operations <nick> +Aiotv

Check on a Phabricator user
As part of an access request you might want to check first if a Phabricator user is actually who they say they are. There is a shell command on the the Phabricator server. see Phabricator#Check_on_a_Phabricator_user
Removing access
This typically isn't part of Clinic Duty, but if you need it you can find the relevant steps at SRE_Offboarding#All_Users​.
Google search console access
Documented at Google Search Console access. Google search console access is extremely limited compared to access to other services. This is due to the limitations of the service.
Revocations are done manually: at the moment, an entry is added to the main-announcement calendar and requires manual action.
Powercycling / reboots
RT duty paging for reboots is usually due to hardware failure, or immediate concerns of exploits. Anything outside those issues would be handled by normal operations workflow, and would not necessarily fall to the RT triage duty person.
Powercycling requires a passing familiarity with the different out of band management options we use (based on vendor). Hardware type can be determined by looking up the hardware in question in Racktables; then you can determine the instructions from Platform-specific_documentation​.
Last edited on 23 July 2021, at 14:27
Wikitech
Content is available under CC BY-SA 3.0 unless otherwise noted.
Privacy policy
Terms of Use
Desktop
HomeRandomLog inSettingsDonateAbout WikitechDisclaimers
WatchEdit