Release Engineering/Drafts/Deployments/How to deploy the train: Difference between revisions
m Light copyediting. |
ββInitial setup: add git prompt details. |
||
Line 6: | Line 6: | ||
== Initial setup == |
== Initial setup == |
||
If this is your first time running the train, you need to do some initial configuration. Start by SSHing into '''deploy1002.eqiad.wmnet'''. |
|||
First, check out the <code>mediawiki/tools/release</code> repo: |
|||
<syntaxhighlight lang="shell-session"> |
<syntaxhighlight lang="shell-session"> |
||
USERNAME@deploy1002:~$ git clone https://gerrit.wikimedia.org/r/mediawiki/tools/release |
USERNAME@deploy1002:~$ git clone https://gerrit.wikimedia.org/r/mediawiki/tools/release |
||
</syntaxhighlight> |
|||
Next, make sure you're using a full-featured Git prompt for Bash, by adding the following to your <code>~/.profile</code>: |
|||
<syntaxhighlight lang="shell"> |
|||
GIT_PS1_SHOWUNTRACKEDFILES=1 |
|||
GIT_PS1_SHOWDIRTYSTATE=1 |
|||
GIT_PS1_SHOWUPSTREAM="auto verbose" |
|||
. /etc/bash_completion.d/git-prompt |
|||
PS1='\u@\h \w$(__git_ps1 " (%s)") \$ ' |
|||
</syntaxhighlight> |
</syntaxhighlight> |
||
Revision as of 20:50, 9 April 2021
This page is currently a draft. More information and discussion about changes to this draft on the talk page. |
Deployments |
---|
|
Initial setup
If this is your first time running the train, you need to do some initial configuration. Start by SSHing into deploy1002.eqiad.wmnet.
First, check out the mediawiki/tools/release
repo:
USERNAME@deploy1002:~$ git clone https://gerrit.wikimedia.org/r/mediawiki/tools/release
Next, make sure you're using a full-featured Git prompt for Bash, by adding the following to your ~/.profile
:
GIT_PS1_SHOWUNTRACKEDFILES=1
GIT_PS1_SHOWDIRTYSTATE=1
GIT_PS1_SHOWUPSTREAM="auto verbose"
. /etc/bash_completion.d/git-prompt
PS1='\u@\h \w$(__git_ps1 " (%s)") \$ '
Pairing on the Train
There are two people assigned to each week's train: One as primary, and one as backup. Please see Release Engineering/Drafts/Deployments/How to pair on the train for an overview of helpful practices.
Breakage
There will be times when this process does not go smoothly. There are guidelines for what do to when that happens.
In general, if there is an unexplained error that occurs within 1 hour of a train deployment β always roll back the train. Rolling back the train to eliminate it as the cause of unexplained breakage can be especially important if there are many ongoing possible causes for issues as this helps to eliminate one of those causes as the source of problems.
Rollback
It should be quick to roll back wikiversion changes. Rollback production before you send patches to Gerrit since waiting on CI may take a while:
USERNAME@deploy1002:/srv/mediawiki-staging$ git revert $(git log -1 --format=%H -- wikiversions.json)
USERNAME@deploy1002:/srv/mediawiki-staging$ scap sync-wikiversions 'Revert "group[0|1] wikis to [VERSION]"'
USERNAME@deploy1002:/srv/mediawiki-staging$ # Now that you've synced the revert, push patches to gerrit. You have to run git commit --amend to get the changeid:
USERNAME@deploy1002:/srv/mediawiki-staging$ git commit --amend
USERNAME@deploy1002:/srv/mediawiki-staging$ git push origin HEAD:refs/for/master%topic=[VERSION],l=Code-Review+2
Example:
USERNAME@deploy1002:/srv/mediawiki-staging$ git push origin HEAD:refs/for/master%topic=1.34.0-wmf.0,l=Code-Review+2
- Wait for the patch to merge and then fetch back down to the deployment server
Places to Watch for Breakage
Train deployers should check for breakage as they are rolling out the train, as they are effectively the first line of defense for train deploys. Some of the places to watch for breakage:
- IRC
- Primary channel is #wikimedia-operations connect
- Useful channels are #mediawiki-core connect #wikimedia-dev connect
- For more channels see MediaWiki on IRC and IRC/Channels
- Logs
- mwlog1001:
- logspam-watch
- logfiles in
/srv/mw-log
- Logstash MediaWiki Errors
- Logstash "mediawiki-new-errors" dashboard (linked from logstash front page)
- Showing only timeout errors (see T204871)
- Group-specific Logstash Dashboards:
- See the Wikimedia-production-error workboard for known issues
- Logstash mw-client-errors dashboard
- New errors appearing more than 1000 times in a 12 hour period should be considered blockers
- See also Grafana dashboard with summary of average error rate over time
- mwlog1001:
- Grafana
- Varnish error-rate dashboard (HTTP 5XXΒ % should have 3+ 0s after the decimal point, e.g. 0.0001%)
- Frontend Responses NGINX vs Varnish
- Production Logging
- Minerva Client Errors - Browser JS errors count (only wikipedias on mobile)
- Application Servers RED Dashboard
If the train is blocked
- A task will be assigned to you, for example T191059 (1.32.0-wmf.13 deployment blockers) (you can see that week's task at https://train-blockers.toolforge.org)
- Any open subtasks block the train from moving forward. This means no further deployments until the blockers are resolved.
Checklist
If there are blocking tasks, please do the following:
- Make sure all tasks blocking train are set to
Unbreak Now!
priority in phabricator - Comment on the task asking for an ETA or if this can be solved by reverting a recent commit.
- Send e-mail to:
- ops@lists.wikimedia.org
- wikitech-l@lists.wikimedia.org
- Ping private #engineering-all Slack channel
- Subject:
[Train] {version} status update: {brief summary}
- Body
The {version} version of MediaWiki is blocked[0]. The new version is deployed to {group(s){0,1,2}}[1], but can proceed no further until these issues are resolved: * {Phab task name} - {phab task link} Once these issues are resolved train can resume. If these issues are resolved on a Friday the train will resume Monday. Thank you for your help resolving these issues! -- Your humble train toiler [0]. <{link to phab task for train}> [1]. <https://versions.toolforge.org/>
- Tag relevant teams and people (see Developers/Maintainers) on the blocking task
- Ping relevant people in IRC
- Once train is unblocked be sure to thank the folks who helped unblock it
Weekly steps
Monday: Sync up with your deployment partner
The primary train conductor will be the assignee of the train blocker task in Phabricator. Backup conductor will be listed as Backup Conductor at the top of the task. On Monday, you should communicate briefly with your partner and establish how you'll collaborate over the course of the week. See Release Engineering/Drafts/Deployments/How to pair on the train for an overview of helpful practices.
Tuesday: New branch creation and deploy
Before the deploy window
Depending on how practiced you are and where you choose to run commands (full clones of mediawiki-core from outside the cluster can take a while), the steps will typically take 45 to 90 minutes.
- Short-form instructions
Step | host | command | example | |
---|---|---|---|---|
P-0 | Verify branch cut job worked | Your laptop | The branch cut runs in a periodic jenkins job that runs on Tuesdays at 02:00 UTC on the releases-jenkins instance. Navigate to gerrit to find the branch commit that the job created.
If there are no open commits shown in gerrit using the link above, you can troubleshoot via the releases-jenkins job.
| |
P-1 | Note the MW core commit from which you've just created the branch | IRC
( |
!log [VERSION] was branched at [BRANCH POINT] for [TASK]
|
!log 1.35.0-wmf.14 was branched at fb16374c5bdb9d14729f358fb81638fc91640b4f for T233862
|
P-2 | Merge the branch commit | Gerrit (example) | C+2 on the patch. It takes about 25 minutes for the branch to be tested and merged. | |
P-3 | Enter screen (or tmux if you prefer)
|
deploy1002.eqiad.wmnet | USERNAME@deploy1002:~$ screen -D -RR train
| |
P-4 | Set local ssh-agent in session | deploy1002 | USERNAME@deploy1002:~$ eval $(ssh-agent)
USERNAME@deploy1002:~$ ssh-add .ssh/id_ed25519
| |
P-5 | Clone new branch in production (once the branch commit from P-2 has landed) | deploy1002 | USERNAME@deploy1002:/srv/mediawiki-staging$ scap prep [VERSION]
|
USERNAME@deploy1002:/srv/mediawiki-staging$ scap prep 1.34.0-wmf.0
|
P-6 | Apply security patches | deploy1002 | USERNAME@deploy1002:/srv/mediawiki-staging$ scap apply-patches --train [VERSION]
| |
P-7 | Create and auto-merge/deploy the testwikis patch
|
deploy1002 | USERNAME@deploy1002:/srv/mediawiki-staging/$ ~/release/bin/deploy-promote testwikis [VERSION]
Promote testwikis from [PREVIOUS-VERSION] to [VERSION] [y/N]
Now wait for jenkins to merge the patch, then press enter to continue with git pull && scap sync-wikiversions
| |
P-8 | Verify version change on testwiki | testwiki | Verify version change on testwiki (Installed software, Product: MediaWiki, Version: [VERSION]) and l10n cache (Special:Version should not look like Special:Version?uselang=qqx). This is done automatically by the deploy script, but is worth verifying manually. | |
P-9 | Decide what old stuff to prune | deploy1002 | USERNAME@deploy1002:/srv/mediawiki-staging/$ find . -mindepth 2 -maxdepth 2 -type f -path './php-*/README.md' -ctime +7 -exec dirname {} \;
| |
P-10 | Clean up old stuff
π Note: this step runs a scap sync of the directory, and can take 30 minutes. |
deploy1002 | USERNAME@deploy1002:/srv/mediawiki-staging/$ scap clean --delete [some old version from find -ctime +7 output above]
|
USERNAME@deploy1002:/srv/mediawiki-staging/$ scap clean --delete 1.34.0-wmf.0
|
Wait for the deploy window |
During the deploy window
- Short-form instructions
Step | host | command | example | |
---|---|---|---|---|
0-0 | Create and auto-merge/deploy the group0 patch | deploy1002 | USERNAME@deploy1002:/srv/mediawiki-staging/$ ~/release/bin/deploy-promote group0
Promote group0 from [PREVIOUS-VERSION] to [VERSION] [y/N]
Now wait for jenkins to merge the patch, then press enter to continue with git pull && scap sync-wikiversions
| |
0-1 | Verify production has indeed switched | MediaWiki.org | Verify that mediawikiwiki has switched to the new version (Installed software, Product: MediaWiki, Version: VERSION) | |
0-2 | Monitor production logs | logstash etc. | Monitor irc and logstash and/or logspam-watch for problems, see #Places to Watch for Breakage | |
0-3 | Update roadmap page | mw:MediaWiki 1.36/Roadmap | Change the Deployed to group (if you're using VisualEditor) or the 3rd parameter of the WMFReleaseTableRow template (if you're using the wikitext editor) to 0 (deployed to group0)
|
{{WMFReleaseTableHead}}
{{WMFReleaseTableRow|12|2018-07-10|0}}
|
0-4 | Kill ssh-agent | deployment server | USERNAME@deploy1002:~$ pgrep -u "$USER" -laf ssh-agent # list all of your ssh-agent processes
USERNAME@deploy1002:~$ pkill -u "$USER" -f ssh-agent # kill all your ssh-agent processes
USERNAME@deploy1002:~$ pgrep -u "$USER" -laf ssh-agent # did they all die?
|
Wednesday: group0 to group1 deploy
- Meta / coordination
Attend the Train Log Triage meeting with members of the Core Platform Team and others.
- Short-form instructions
Step | host | command | example | |
---|---|---|---|---|
1-0 | Create and auto-merge/deploy the group1 patch | deploy1002 | USERNAME@deploy1002:/srv/mediawiki-staging/$ ~/release/bin/deploy-promote
Promote group1 from [PREVIOUS-VERSION] to [VERSION] [y/N]
Now wait for jenkins to merge the patch, then press enter to continue with git pull && scap sync-wikiversions
| |
1-1 | Verify production has indeed switched | English Wiktionary | Verify that the English Wiktionary (and other group1 wikis) have switched to the new version (Installed software, Product: MediaWiki, Version: VERSION) | |
1-2 | Monitor production logs | logstash etc. | Monitor irc and logstash and/or logspam-watch for problems, see #Places to Watch for Breakage | |
1-3 | Update roadmap page | mw:MediaWiki 1.36/Roadmap | Change the Deployed to group (if you're using VisualEditor) or the 3rd parameter of the WMFReleaseTableRow template (if you're using the wikitext editor) to 1 (deployed to group1)
|
{{WMFReleaseTableHead}}
{{WMFReleaseTableRow|12|2018-07-10|1}}
...
{{WMFReleaseTableFooter}}
|
Thursday: group{0,1} to all deploy
- Short-form instructions
Step | host | command | example | |
---|---|---|---|---|
2-0 | Create and auto-merge/deploy the group2 patch | deploy1002 | USERNAME@deploy1002:/srv/mediawiki-staging/$ ~/release/bin/deploy-promote all
Promote all from [PREVIOUS-VERSION] to [VERSION] [y/N]
Now wait for jenkins to merge the patch, then press enter to continue with git pull && scap sync-wikiversions
| |
2-1 | Verify production has indeed switched | English Wikipedia | Verify that the English Wikipedia (and other group2 wikis) have switched to the new version (Installed software, Product: MediaWiki, Version: VERSION) | |
2-2 | Monitor production logs | logstash etc. | Monitor irc and logstash and/or logspam-watch for problems, see #Places to Watch for Breakage | |
2-3 | Update roadmap page | mw:MediaWiki 1.36/Roadmap | Change the Deployed to group (if you're using VisualEditor) or the 3rd parameter of the WMFReleaseTableRow template (if you're using the wikitext editor) to 2 (deployed to all)
|
{{WMFReleaseTableHead}}
{{WMFReleaseTableRow|12|2018-07-10|2}}
...
{{WMFReleaseTableFooter}}
|
Incident documentation
- If there were problems during the train, follow instructions at Incident documentation on incident reports and post-mortem review.
- Use
Create report
form to create a new page,train-[VERSION]
. Example: Incident documentation/20181212-Train-1.33.0-wmf.8. - For the Timeline section, events from SAL and Phabricator task are a good start.
See also
- For the current versions deployed to the various wikis, see https://versions.toolforge.org/
Footnotes
- β If you need to leave in the middle you can do
ctrl-a d
to detach andscreen -r train
to attach.