Mirroring Wikimedia project XML dumps/estimates

From Meta, a Wikimedia project coordination wiki

September 2023 estimate[edit]

1 full dump[edit]

In /mnt/nfs/dumps-clouddumps1001.wikimedia.org/, I ran:

  • du -sc *wik*/20230901
  • 8911036880KB ~= 9TB

June 2019 estimate[edit]

From labstore1007.wikimedia.org:/srv/dumps/xmldatabases/public, I ran:

  • du -sc *wik*/20190401
    5399908744 kbytes, about 5.0T - last full dump
  • du -sc *wik*/20190320 *wik*/20190401
    7050667472 kbytes, about 6.5T - last 2 complete dumps, one full and one partial
  • cat rsync-filelist-last-4-good.txt | sed -e 's/^/./g;' | tr '\n' '\0' | du -sc --files0-from=- > /root/space-needs-precise-last4-2019.txt
    14273771168 kbytes, about 13.3T
  • cat rsync-filelist-last-5-good.txt | sed -e 's/^/./g;' | tr '\n' '\0' | du -sc --files0-from=- > /root/space-needs-precise-last5-2019.txt
    17478319256 kbytes, about 16.3T

March 2014 estimate[edit]

From database1001:/data/xmldatabases/public, I ran:

  • cat rsync-filelist-last-1-good.txt | sed -e 's/^/./g;' | tr '\n' '\0' | du -sc --files0-from=- > /root/space-needs-precise-last1-2014.txt
    2293020248 kbytes, about 2.1T
  • cat rsync-filelist-last-2-good.txt | sed -e 's/^/./g;' | tr '\n' '\0' | du -sc --files0-from=- > /root/space-needs-precise-last2-2014.txt
    4566567252 kbytes, about 4.2T
  • cat rsync-filelist-last-4-good.txt | sed -e 's/^/./g;' | tr '\n' '\0' | du -sc --files0-from=- > /root/space-needs-precise-last4-2014.txt
    9058836524 kbytes, about 8.4T
  • cat rsync-filelist-last-5-good.txt | sed -e 's/^/./g;' | tr '\n' '\0' | du -sc --files0-from=- > /root/space-needs-precise-last5-2014.txt
    11275812556 kbytes, about 10.5T

August 5 2013 estimate[edit]

I wrote a tiny script that would give me the du across all projects for the first dump of each month, to track growth. Using that I have the following:

total for year 2012, month 07: 1,024,149,808
total for year 2012, month 08: 1,470,942,004
total for year 2012, month 09: 1,493,108,864
total for year 2012, month 10: 1,689,314,104
total for year 2012, month 11: 1,758,897,860
total for year 2012, month 12: 1,790,053,612
total for year 2013, month 01: 1,813,492,404
total for year 2013, month 02: 1,845,837,948
total for year 2013, month 03: 2,010,964,172
total for year 2013, month 04: 1,737,291,644
total for year 2013, month 05: 1,935,464,408
total for year 2013, month 07: 2,003,609,136

Rather scary!

9 803 791 956 bytes (9.2T) via the method listed below for the last 5 good dumps, 4 004 194 960 (3.8T) for the last 2 good dumps, and 2 031 829 840 (1.9T) for the last 1 good dump across all wikis.

Jan 21 2012 estimate[edit]

last 5[edit]

I have a list of the last 5 complete dumps for each project; we generate it for rsyncing to mirror sites. This list includes 5 full complete dumps of the beast, enwiki. Running

cat rsync-list.txt | sed -e 's/^/./g;' | tr '\n' '\0' | du -sc --files0-from=- > /data/xmldatadumps/atgtesting/space-needs-precise-2012.txt

gave me a total of 6347870404 bytes or 6.0T in human-readable form.

last 2[edit]

As above, but starting with a list of the last 2 good dumps, along with a du based on the output of the file list, we got 2598845584 bytes or 2.5T in human-readable form.

last 1[edit]

I generated a list of the files in the last good dump across all projects, using our rsync list generation script. From that and a similar du to the above, the space used is 1308697212 bytes or 1.3T in human-readable form.

Dec. 16 2010 estimate[edit]

The source of the 1.3T estimate is as follows:

I ran a simple du script on our copy of the dumps. It skipped "bad" and "archive" dumps (known to be incomplete or corrupt) and only looked at the dumps that completed. It might have counted the most recent dump for a project with some failed items, this shouldn't have cost us much in the accuracy.

Original total: 648 235 316 K = 648 GB.

For enwiki it used the 20100730 dumps, which have a total of 85 072 676 K = 85 GB. It should have used the most complete = 20100904 which are 419 439 240 K = 420 GB. The difference in size is: 334 366 564 K = 334 GB.

Adding that to our previous total we now get 982 601 880 = 983 GB.

One more factor: we did not run the 7z compression on the 09 04 dumps, which would give us about 35GB more. We also did not do the recombine of the page-meta-history bz2s; this would have given us 343 663 832 874 bytes = 344 GB more. Adding those to 983 GB we get a grand whopping total of 344+35+983 = 1362 GB.

(Someone can check my arithmetic, I suck.)