Data dumps/Other tools
< Data dumps
There are three options to use the compressed data dumps: decompress them, which is time and memory consuming, reading the compressed files with general purpose library, e.g. Python's Bz2file, or using one of the custom Wikipedia readers/libraries.
Other tools
WikiXRay Python parser
WikiXRay is a Python tool for automatically processing Wikipedia's XML dumps for research purposes.
It also includes the more complete parser to extract metadata for all revisions and pages in a WIkipedia's XML dump, compressed with 7zip (or any other version). See the WikiXRay page on Meta for more info.
WikiPrep Perl script
Wikipedia preprocessor (wikiprep.pl) is a Perl script that preprocesses raw XML dumps and builds link tables, category hierarchies, collects anchor text for each article etc. Of interest is also new SourceForge page with more updated branches: wikiprep.sf.net
The version described above works only for old (a few years old) dumps, and hence it is not maintained, it WILL break on current dumps. Although, the idea is not abandoned and is maintained by Tomaz Solc undel GPL and it is available here. This version spans multiple processes if required to speed up the process.
Wikipedia Dump Reader
This program provides a convenient user interface to read the text-only xml compressed dumps.
No conversion is needed, only some index-construction initial step. Written mostly in Python+Qt4, except for the small, very portable bzip2-decompression C code, thus should run on all PyQt4-enabled platform, although tested only on Desktop Linux. Wikicode is reinterpreted, thus it may sometimes display differently than the official php interpreter.
https://launchpad.net/wikipediadumpreader
MediaWiki XML Processing
This python library is a collection of utilities for efficiently processing MediaWiki’s XML database dumps. There are two important concerns that this module intends to address: performance and the complexity of streaming XML parsing.
https://pythonhosted.org/mwxml/
MediaWiki SQL Processing
This python library is a collection of utilities for efficiently processing MediaWiki’s SQL database dumps. It is built to be very similar to mwxml but for the SQL dumps.
https://pypi.org/project/mwsql/
BzReader (Windows offline reader)
This program allows the Windows users to read Wikipedia offline using compressed dumps.
There is a fast built-in full-text search and the Wiki code is interpreted as HTML. You can also navigate between articles just like in online Wikipedia.
https://code.google.com/archive/p/bzreader/downloads
bzip2
For the .bz2 files, use bzip2 to decompress. bzip2 comes standard with most Linux/Unix/Mac OS X systems these days. For Windows you may need to obtain it separately from the link below.
http://www.bzip.org/downloads.html
mwdumper can read the .bz2 files directly, but importDump.php requires piping like so: bzip2 -dc pages_current.xml.bz2 | php importDump.php
7-Zip
For the .7z files, you can use 7-Zip or p7zip to decompress. These are available as free software:
Something like: 7za e -so pages_current.xml.7z | php importDump.php
will expand the current pages and pipe them to the importDump.php PHP script.
Even more tools
BigDump - A small php-script for importing very large mySQL dumps (Even through web-servers with hard runtime limits or Safe mode!)
And still more...
A number of offline readers of Wikipedia have been developed.
A list of alternative parsers and related tools is available for perusal. Some of these are downloaders, some are parsers of the XML dumps and some are meant to convert wikitext for a single page into rendered HTML.
See also this list of data processing tools intended for use with the Wikimedia XML dumps.

Last edited on 13 August 2021, at 15:39
Meta
Content is available under CC BY-SA 3.0 unless otherwise noted.
Privacy policy
Terms of Use
Desktop
 Home Random Log in  Settings  Donate  About Meta  Disclaimers
WatchEdit