Dumps/XML-SQL Dumps
< Dumps
We want mirrors! For more information see Dumps/Mirror status.
Docs for end-users of the xml/sql dumps can be found on meta. If you're a Toolforge user and want to use the dumps, check out Help:Shared storage for information on where to find the files.
Current Info
Older Info
Hodge Podge
For a list of various information sources about the dumps, see Dumps/Other information sources.
The following info is for folks who hack on, maintain and administer the dumps and the dump servers.
Setup
Current architecture
Rather than bore you with that here, see Dumps/Current Architecture.
Current hosts
For which hosts are serving data, see Dumps/Dump servers. For which hosts are generating dumps, see Dumps/Snapshot hosts. For which hosts are providing space via NFS for the generated dumps, see Dumps/Dumpsdata hosts.
Adding a new snapshot host
Install and add to site.pp in the snapshot stanza (see snapshot1005-9). Add the relevant hiera entries, documented in site.pp, according to whether the server will run enwiki or wikidatawiki xml/sql dumps (only one server should do so for each of these huge wikis), or misc cron jobs (one host should do so, and it should not run xml/sql dumps).
Dumps run out of /srv/deployment/dumps/dumps/xmldumps-backup on each server. Deployment is done via scap3 from the deployment server.
Starting dump runs
Do nothing. These jobs run out of cron.
Troubleshooting
Fixing code
The python dumps scripts are all in the operations/dumps.git repo, branch 'master'. Various supporting scripts that are not part of the dumps proper, are in puppet; you can find those in the snapshot module.
The python dump scripts rely on a number of C utilities for manipulating MediaWiki xml files and/or bzip2-compressed files. These can be found in the operations/dumps/mwbzutils repo.
Getting a copy of the python scripts as a committer:
git clone ssh://<user>@gerrit.wikimedia.org:29418/operations/dumps.git
git checkout master
ssh to the deployment host.
  1. cd /srv/deployment/dumps/dumps
  2. git pull
  3. scap deploy
Note: you likely need to be in the ops ldap group to do the scap. Also note that changes pushed will not take place until the next dump run; any current run uses the existing dump code to complete.
Fixing configuration files
Configuration file setup is handled in the snapshot puppet module. You can check the config files themselves at /etc/dumps/confs on any snapshot host.
Out of space
See Dumps/Dumpsdata hosts#Space issues if we are running out of space on the hosts where the dumps are written as generated.
See Dumps/Dump servers#Space issues if we are running out of space on the dumps web or rsync servers.
Rsync slow
Symptoms: dumps output files take a long time to show up on the web server or the internal NFS mounts server.
If rsync from the xmldumps primary NFS server to one of the public facing hosts seems to take a lot longer than from the secondary NFS server after adjusting for the amount of data to be copied, check load and memory use on the primary NFS server as compared to te secondary. In the past, an update to a cron job made that job take much longer than usual for it to complete, so that may copies of the script stacked up all running at the same time. Little should be running there; if you see many copies of the same python script, shoot them all and open a Phabricator task so that the problem can be fixed.
Broken dumps
The dumps can break in a few interesting ways.
They no longer appear to be running. Is the monitor running? See below. If it is running, perhaps all the workers are stuck on a stage waiting for a previous stage that failed.
Shoot them all and let the cron job sort it out. You can also look at the error notifications section and see if anything turns up; fix the underlying problem and wait for cron.
A dump for a particular wiki has been aborted. This may be due to me shooting the script because it was behaving badly, or because a host was powercycled in the middle of a run.
The next cron job should fix this up.
A dump on a particular wiki has failed.
Check the information on error notifications, track down the underlying issue (db outage? MW deploy of bad code? Other?), fix it, and wait for cron to rerun it.
A dump has hung on some step, the processes in the pipeline apparently reading/writing and yet no output being produced.
We get email notifications to ops-dumps@wikimedia.org if there is a lockfile for a wiki and no file updated within the last 4 hours. These must be investigated on a case by case basis.
Error notifications
Email is ordinarily sent if a dump does not complete successfully, going to ops-dumps@wikimedia.org which is an alias. If you want to follow and fix failures, add yourself to that alias.
Logs are kept of each run. From any snapshot host, you can find the logs in the directory (​/mnt/data/xmldatadumps/private/<wikiname>/<date>/dumplog.txt​). From these you may glean more reasons for the failure.
Logs that capture the rest are available in /var/log/dumps/ and may also contain clues.
When one or more steps of a dump fail, the index.html file for that dump includes a notation of the failure and sometimes more information about it. Note that one step of a dump failing does not prevent other steps from running unless they depend on the data from that failed step as input.
Monitoring is broken
If the monitor does not appear to be running (the index.html file showing the dumps status is never updated), check which host should have it running (look for the host with profile::dumps::generation::worker::monitor in the role, at this writing snapshot1007). This is a service that should be restarted with systemd or upstart, depending on the os version, so you'll want to see what change broke it.
Rerunning dumps
You really really don't want to do this. These jobs run out of cron. All by themselves. Trust me. Once the underlying problem (bad MW code, unhappy db server, out of space, etc) is fixed, it will get taken care of.
Okay, you don't trust me, or something's really broken. See Dumps/Rerunning a job if you absolutely have to rerun a wiki/job.
A dump server (snapshot host) dies
If it can be brought back up within a day, don't bother to take any measures, just get the box back in service. If there are deployments scheduled in the meantime, you may want to remove it from scap targets for mediawiki: edit hieradata/common/scap/dsh.yaml for that.
If it's the testbed host (check the role in site.pp), just leave everything alone, no services will be impacted
If it will take more than a day to be fixed, swap it for the testbed/canary box, and remove it from scap targets for mediawiki:
A dumpsdata host dies
Coming soon... but in the meantime see Dumps/XML-SQL Dumps/Swapping NFS servers which explains the steps for swapping the primary and fallback xml/sql NFS servers when they are both operational.
A dumpsdata host has NFS issues
Maybe icinga alerted, or maybe you noticed that the dumps snapshot hosts have extra high load and that there are NFS timeouts in their syslogs. First, check the obvious; is the array full? Is the box so loaded that something OOMed? Is there anything bizarre in the syslog or other logs?
Assuming you see nothing unusual, and nfsd is still running:
We do a lot of disk I/O on these NFS-mounted filesystems; multiple dumps jobs running in parallel on multiple hosts, plus an rsync to copy off data to the fallback dumpsdata host and the labstore boxes, could be more than the disks can handle. Check disk utilization and IOPs and see what's going on. Narrow spikes of 100% utilization are normal, but no more than that. If that's the problem, check if there was an rsync going when the alert was triggered; if so, you can try being more aggressive with rsync bandwidth caps (look for the BWLIMIT setting).
We applied a class in the past to adjust vm.min_free_kbytes for hosts with 16GB RAM that were also providing web service at the time. These settings have not been altered for the current dumpsdata hosts which have 32GB RAM and do only NFS for dumps generation, and rsync to internal peers; perhaps they should be. There's an open ticket for the dumps public-facing servers ([1]) but not for the dumpsdata boxes.
Some of the past history of issues on the old dump nfs servers can be found in this phab task).
Note that nfs cache use has been a problem in the past with data consistency so we have actimeo=0 on the clients. (See Phab task.)) This could be revisited.
A labstore host dies (web or nfs server for dumps)
These are managed by Wikimedia Cloud Services. When this situation should arise, someone on that team should conduct the procedure below.
At current writing there are two labstore boxes that we care about; one serves web to the public + NFS to stats hosts; the other serves NFS to cloud VPS instances/toolforge.
Notes on NFS issues and Toolforge load
This may be untrue now. WMCS believes we have found and corrected the root cause of the rising load issue when dumps servers are off line. This has yet to be tested, though.
Both hosts' NFS filesystems are mounted on all hosts that use either server for NFS, and the clients determine which nfs filesystem to use based on a symlink that varies from cluster to cluster. The dumps_dist_active_web setting only affects the symlink to the NFS filesystem on the stats hosts. Likewise, the dumps_dist_active_vps only affects the symlink to NFS filesystem on the VPSes (including Toolforge).
If the server is the vps NFS server (the value of dumps_dist_active_vps), Toolforge is probably losing its mind by now. The best that can be done is to remove it from dumps_dist_nfs_servers and change dumps_dist_active_vps to the working server and unmount that NFS share everywhere you possibly can. The earlier this is done, the better. Load will be climbing like mad on any Cloud VPS server, including Toolforge nodes the entire time. This may or may not stop because you unmounted everything.
Last edited on 22 April 2022, at 22:45
Wikitech
Content is available under CC BY-SA 3.0 unless otherwise noted.
Privacy policy
Terms of Use
Desktop
HomeRandomLog in Settings DonateAbout WikitechDisclaimers
WatchEdit