Dumps/Rerunning a job

Fixing a broken dump

Back in the day, the dumps were meant to be generated in an endless loop with no human intervention; if a dump broken you wated another 10 days or so until your project's turn came around again and then there was a new one.

These days folks want the data Right Now, and some dumps take a good long time to run (*cough*wikidata*cough*). If you see a broken or failed dump, this is how you can fix it up.

Rerunning dumps on all wikis

If the host crashes while the dump scheduler is running, the status files are left as-is and the display shows any dumps on any wikis as still running until the monitor node decides the lock file for those wikis is stale enough to mark is as aborted.

To restart the scheduler from where it left off:

Really, you can just wait for systemd to pick it up; it checks twice a day for aborted runs, unless the job has fallen outside of the run date range. You can check the date range by looking at the systemd timer job entry on any snapshot host for the appropriate entry for fulldumps.sh.

If you're outside the range, just do this:

be on each appropriate host as root
start a screen session
su - dumpsgen
bash fulldumps.sh rundate 'today' regular|huge full|partial 28|20

rundate should be replaced with the day of the month the run started, in DD format. Example: the first monthly run would use '01' to run on the first of the month.
today should be replaced with today's day of the month in DD format
regular means that everything but enwiki will be dumped, 'enwiki' means only enwiki will be dumped, 'wikidatawiki' means only wikidata will be dumped
full means that page content history will be dumped along with everything else; 'partial' means this will be skipped
28 should be used on snapshot hosts dedicated to xml/sql dumps. Don't run the script anywhere else. This number is the maximum number of processes to be dedicated to the run on the host.

Example: bash fulldumps.sh 01 17 regular full 28

If the worker script encounters more than three failed dumps in a row it will exit; this avoids generation of piles of broken dumps which later would need to be cleaned up. Once the underlying problem is fixed, you can go to the screen session of the host running those wikis and rerun the previous command.

Rerunning a complete dump

If most of the steps failed, or the script failed or died early on, you might as well rerun the entire thing. If the date is within the date range for the cron job for the specific dump type (full = with history, partial = without content history), just wait for the entire dump to be rerun. You can check this by looking at the appropriate entry in the crontab for he dumpsgen user on a snapshot host that runs the wiki types you want (enwiki, wikidatawiki, or regular for all the rest). Have a look at hiera/hosts/snapshotXXX to see which snapshot hosts run which dumps; if the type isn't listed there it's 'regular'.

Otherwise, follow the steps below.

Be root on any snapshot testbed host.
start a screen session (these dumps take a while to run).
su - dumpsgen
cd /srv/deployment/dumps/dumps/xmldatadumps
determine which config file argument the wiki uses: enwiki uses /etc/dumps/confs/wikidump.conf:en, wikidatawiki uses /etc/dumps/confs/wikidump.conf.dumps:wd. "big" wikis (see list in dblists.pp, look for the definition of "$bigwikis") use /etc/dump/confs/wikidump.conf.dumps:bigwikis. The rest use /etc/dumps/confs/wikidump.conf.dumps with no extra parameter.
Make sure any process dumping the wiki in question has stopped:
- determine which worker host is or was running the job, looking for the file /mnt/dumpsdata/xmldatadumps/private/wiki-name-here/lock-somenumber.txt
- if that file exists, check its contents for the name of the snapshot host running the job, and the pid
- be on that host
- python3 dumpadmin.py --kill --wiki <wikiname here> --configfile confs/<config-file-arg here>
Double check that the processes are gone, pkill them if you have to.
Clean up the lock file left behind if any, either removing the file directly, or doing:
- python3 dumpadmin.py --unlock --wiki <wikiname here> --configfile confs/<config-file-arg here>
Verify that the lockfile /mnt/dumpsdata/xmldatadumps/private/wiki-name-here/lock-somenumber.txt is gone
On the testbed host, rerun the entire dump. Steps already completed properly will be skipped
- If it's the full run with history, do this:
bash ./worker --date last --skipdone --exclusive --log --configfile <config-file-arg here> --wiki <wikiname-here>
- If it's not the full history, but the abbreviated run, do this:
bash ./worker --date last --skipdone --exclusive --log --configfile <config-file-arg here> --skipjobs metahistorybz2dump,metahistorybz2dumprecombine,metahistory7zdump,metahistory7zdumprecombine --wiki <wikiname-here>

NOTE THAT if a new dump is already running for the project by the time you discover the broken dump, you can't do this. You should wait for the new dump to complete, and at that time do the above, replacing --date last with --date date_of_run where date-of_run is the same as the name of the dump directory, a date in YYYYMMDD format.

Rerunning one piece of a dump

ONLY do this if the dump script is already running another job for that wiki (and hence locked) and you really really have to have that output Right Now.

As above, you'll need to determine the date, which configuration file argument you need, and which host to run from.
You don't need to do anything about lockfiles.
Determine which job (which step) needs to be re-run. Presumably the failed step has been recorded on the web-viewable page for the particular dump (http://dumps.wikimedia.org/wikiname-here/YYYYmmdd/) in which case it should be marked as status:failed in the dumpruninfo.txt file in the run directory. Use the job name from that file, and remember that they are listed in reverse order of execution. If you were told by a user or aren't sure which job is the one, see Dumps/Phases of a dump run to figure out the right job(s).
If there's already a root screen session on the host, use it, otherwise start a new one. Open a window,
- su - dumpsgen
- cd /srv/deployment/dumps/dumps/xmldatadumps
- bash ./worker --job job-name-you-found --date YYYYmmdd --configfile /etc/dumps/conf/wikidump.conf.dumps:XXX --log --wiki name-of-wiki
The date in the above will be the date in the directory name and on the dump web page for that wiki.

Example: to rerun the generation of the bzip2 pages meta history file for the enwiki dumps for January 2012 you would run

bash ./worker --job metahistorybz2dump --date 20120104 --configfile /etc/dumps/confs/wikidump.conf.dumps:en --log --wiki enwiki

Rerunning an interrupted en wikipedia history dump

Clean up the underlying issue and wait for systemd to pick it up.

If systemd does not pick it up because the dump is running too late in the month and would not finish in time before the next run, you can run it by hand as explained above in 'rerunning a complete dump' and the rest will take care of itself. HOWEVER you may be delaying the next dump run. It's your funeral!

Rerunning a dump from a given step onwards

ONLY do this if the output for these jobs is corrupt and needs to be regenerated. Otherwise follow the instructions to rerun a complete dump, which will simply rerun steps with missing output.

Do as described in 'Rerunning one piece of a dump' but add '--cleanup', '--exclusive', and '--restart' args before the wikiname.

You must not do this while a dump run for the wiki is in progress.

Rerunning a step without using the python scripts

ONLY FOR DEBUGGING.

Sometimes you may want to rerun a step using mysql or the MediaWiki maintenance scripts directly, especially if the particular step causes problems more than once.

In order to see what job was run by the worker.py script, you can either look at the log (dumplog.txt) or you can run the step from worker.py giving the "dryrun" option, which tells it "don't actually do this, write to stderr the commands that would be run".

Determine which host the wiki is dumped from, which configuration file is used, the date of the dump and the job name, as described in the section above about rerunning one piece of a dump.
Give the appropriate worker.py command, as in that same section, adding the option "--dryrun" before the name of the wiki.

Examples

First be root on a testbed snapshot host.

su - dumpsgen
cd /srv/deployment/dumps/dumps/xmldumps-backup

To see how the category table gets dumped, type:
python3 ./worker.py --date 20190401 --job categorytable --configfile /etc/dumps/confs/wikidump.conf.dumps --dryrun elwiktionary

to get the output
Command to run: /usr/bin/mysqldump -h 10.64.48.35 -u XXX -pXXX --max_allowed_packet=32M --opt --quick --skip-add-locks --skip-lock-tables elwiktionary category | /bin/gzip > /mnt/dumpsdata/xmldatadumps/public/elwiktionary/20190401/elwiktionary-20190401-category.sql.gz.inprog
To see how the stub xml files get dumped, type:
python3 ./worker.py --date 20190401 --job xmlstubsdump --configfile /etc/dumps/confs/wikidump.conf.dumps --dryrun elwiktionary

to get the output
Command to run: /usr/bin/python3 xmlstubs.py --config /etc/dumps/confs/wikidump.conf.dumps --wiki elwiktionary --articles /mnt/dumpsdata/xmldatadumps/public/elwiktionary/20190401/elwiktionary-20190401-stub-articles.xml.gz.inprog --history /mnt/dumpsdata/xmldatadumps/public/elwiktionary/20190401/elwiktionary-20190401-stub-meta-history.xml.gz.inprog --current /mnt/dumpsdata/xmldatadumps/public/elwiktionary/20190401/elwiktionary-20190401-stub-meta-current.xml.gz.inprog --filter=namespace:!NS_USER

As you see from the above, all three stub files are written at the same time.

The xmlstubs.py script calls a Mediawiki maintenance script. To see how that is called, FOR DEBUGGING/TESTS ONLY, type:
/usr/bin/python3 xmlstubs.py --config /etc/dumps/confs/wikidump.conf.dumps --wiki elwiktionary --articles /mnt/dumpsdata/xmldatadumps/public/elwiktionary/20190401/elwiktionary-20190401-stub-articles.xml.gz.inprog --history /mnt/dumpsdata/xmldatadumps/public/elwiktionary/20190401/elwiktionary-20190401-stub-meta-history.xml.gz.inprog --current /mnt/dumpsdata/xmldatadumps/public/elwiktionary/20190401/elwiktionary-20190401-stub-meta-current.xml.gz.inprog --dryrun

to get the output
would run command: /usr/bin/php7.2 /srv/mediawiki/multiversion/MWScript.php dumpBackup.php --wiki=elwiktionary --full --stub --report=1000 --output=file:/mnt/dumpsdata/xmldatadumps/temp/e/elwiktionary/elwiktionary-20190401-stub-meta-history.xml.gz.inprog_tmp --output=file:/mnt/dumpsdata/xmldatadumps/temp/e/elwiktionary/elwiktionary-20190401-stub-meta-current.xml.gz.inprog_tmp --filter=latest --output=file:/mnt/dumpsdata/xmldatadumps/temp/e/elwiktionary/elwiktionary-20190401-stub-articles.xml.gz.inprog_tmp --filter=latest --filter=notalk --filter=namespace:!NS_USER --skip-footer --start=1 --end=1

would run command: /usr/bin/php7.2 /srv/mediawiki/multiversion/MWScript.php dumpBackup.php --wiki=elwiktionary --full --stub --report=1000 --output=file:/mnt/dumpsdata/xmldatadumps/temp/e/elwiktionary/elwiktionary-20190401-stub-meta-history.xml.gz.inprog_tmp --output=file:/mnt/dumpsdata/xmldatadumps/temp/e/elwiktionary/elwiktionary-20190401-stub-meta-current.xml.gz.inprog_tmp --filter=latest --output=file:/mnt/dumpsdata/xmldatadumps/temp/e/elwiktionary/elwiktionary-20190401-stub-articles.xml.gz.inprog_tmp --filter=latest --filter=notalk --filter=namespace:!NS_USER --skip-header --start=1 --skip-footer --end 5001

and so on. if you are doing your own tests you can change the page range and adjust inclusion of Mediawiki xml headers and footers at will.
To see how the full history xml bzipped file is dumped, type:
python3 ./worker.py --date 20190401 --job metahistorybz2dump --configfile /etc/dumps/confs/wikidump.conf.dumps --dryrun elwiktionary

to get the output
Command to run: /usr/bin/php7.2 /srv/mediawiki/multiversion/MWScript.php dumpTextPass.php --wiki=elwiktionary --stub=gzip:/mnt/dumpsdata/xmldatadumps/public/elwiktionary/20190401/elwiktionary-20190401-stub-meta-history.xml.gz --prefetch=7zip:/mnt/data/xmldatadumps/public/elwiktionary/20190401/elwiktionary-20190301-pages-meta-history.xml.7z --report=1000 --spawn=/usr/bin/php7.2 --output=bzip2:/mnt/dumpsdata/xmldatadumps/public/elwiktionary/20190401/elwiktionary-20190401-pages-meta-history.xml.bz2.inprog --full

Note that if the xml stubs job has not run for the date you specify, the script will refuse to run the meta history dumps, even in dryrun mode, and it will tell you so.

Don't be surprised if you see that it would prefetch from a file more recent than the dump you are fixing up. If the more recent dump is marked as successful, the script will try to do just that, which may be unexpected behaviour but should give you good output... unless you suspect the more recent dump of having bad data. In that case you should see the sections about prefetch below.

General notes about the above commands:

On a large wiki, you will specify --configfile /etc/dumps/confs/wikidump.conf.dumps:bigwikis (or perhaps en or wd) and jobs except tables will have several processes running at once, with numbered output files.
Output is written into file(s) with the extension '.inprog', and rsync will skip these so they don't show up on our web server, for example.
Output of some files goes into files in a temporary directory; this is true of stubs files which, in spite of the inner extension '.gz' are plain text. They are fed to gzip as they are created, and that file is written into the final output directory. When you do testing you may specfy gzip:/path/to/file.gz instead of file:/path/to/temp/file.
Tests/deugging should never write to the public directory. Generate files in e.g. /mnt/dumpsdata/temp/dumpsgen/some-subdir and clean up or move them into place when you are done. You can use wikidump.conf.tests instead of wikidump.conf.dumps for the python or bash wrapper scripts; this will cause input to be read from the tree under /mnt/dumpsdata/temp/dumpsgen/ and output to be written to the same tree. If you are rerunning a job that requires stub files or previous page content files, you will need to copy them into the new location, if you are invoking the bash wrapper or python scripts rather than the MW maintenance scripts directly.

Generating new dumps

When new wikis are enabled on the site, they are added to all.dblist which is checked by the dump scripts. They get dumped as soon as a worker completes a run already in progress, so you don't have to do anything special for them.

Text revision files

A few notes about the generation of the files containing the text revisions of each page.

Stub files as prequisite

You need to have the "stub" XML files generated first. These get done much faster than the text dumps. For example, to generate the stubs files for en wikipedia without doing multiple pieces at a time, took less than a day in early 2010 but to generate the full hiistory file without parallel runs took over a month and today would take much longer.

While you can specify a range of pages to the script that generates the stubs, there is no such option for generating the revision text files. The revision ids in the stub file used as input determine which revisions are written as output.

Prefetch from previous dumps

In order to save time and wear and tear on the database servers, old data is reused to the extent possible; the production scripts run with a "prefetch" option which reads revision texts from a previous dump and, if they pass a basic sanity check, writes them out instead of polling the database for them. Thus, only new or restored revisions in the database should be requested by the script.

Using a different prefetch file for revision texts

Sometimes the file used for prefetch may be broken or the XML parser may balk at it for whatever reason. You can deal with this in two ways.

You could mark the file as bad, by going into the dump directory for the date the prefetch file was generated and editing the file dumpruninfo.txt, changing "status:done;" to "status;bad" for the dump job (one of articlesdump, metacurrentdump or metahistorybz2dump), and rerun the step usin the python script worker.py.
You could run the step by hand without the python script, (see the section above on how to do that), specifying prefetch from another earlier file or set of files. Example: to regenerate the ekwiktionary history file from 20190401 with a prefetch from the 20190201 output instead of the 20190301 files, type:
/usr/bin/php7.2 /srv/mediawiki/multiversion/MWScript.php dumpTextPass.php --wiki=elwiktionary --stub=gzip:/mnt/dumpsdata/xmldatadumps/public/elwiktionary/20190401/elwiktionary-20190401-stub-meta-history.xml.gz --prefetch=7zip:/mnt/dumpsdata/xmldatadumps/public/elwiktionary/20190201/elwiktionary-20190201-pages-meta-history.xml.7z --report=1000 --spawn=/usr/bin/php --output=bzip2:/mnt/dumpsdata/xmldatadumps/public/elwiktionary/20190401/elwiktionary-20190401-pages-meta-history.xml.bz2.inprog --full
Don't forget to move the output file into place (/mnt/dumpsdata/xmldatadumps/public/elwiktionary/20190401/elwiktionary-20190401-pages-meta-history.xml.bz2) when it's complete.

We don't keep many old dump runs on the dumpsdata hosts (which provide the filesystem mounted at /mnt/dumpsdata), so you can't go back more than two months for prefetch files. In a pinch you could grab older ones off the web server via rsync and put them in a temporary directory on the dumpsdata host; then adjust the path to the file in the prefetch argument to the script.

Skipping prefetch for revision texts

Sometimes you may not trust the contents of the previous dumps or you may not have them at all. In this case you can run without prefetch but it is much slower so avoid this if possible for larger wikis. In this case you can do one of the following:

run the worker.py script with the option --noprefetch
run the step by hand without the python script, (see the section above on how to do that), removing the prefetch option from the command. Example: to regenerate the ekwiktionary history file from 20120109 without prefetch, you would type:
/usr/bin/php7.2 /srv/mediawiki/multiversion/MWScript.php dumpTextPass.php --wiki=elwiktionary --stub=gzip:/mnt/dumpsdata/xmldatadumps/public/elwiktionary/20190401/elwiktionary-20190401-stub-meta-history.xml.gz --report=1000 --spawn=/usr/bin/php7.2 --output=bzip2:/mnt/dumpsdata/xmldatadumps/public/elwiktionary/20190401/elwiktionary-20190401-pages-meta-history.xml.bz2.inprog --full