Wikipedia Preprocessor (WikiPrep)

Developed and maintained by Evgeniy Gabrilovich (gabr@cs.technion.ac.il)
  1. Overview
  2. Description
  3. Conditions of use
  4. Support
  5. References

NewNews

The code is being slowly moved to SourceForge.net, where it will be hosted as project WikiPrep. Stay tuned !

Overview

Wikipedia is a terrific knowledge resource, and many recent studies in artificial intelligence, information retrieval and related fields used Wikipedia to endow computers with (some) human knowledge. Wikipedia dumps are publicly available in XML format, but they have a few shortcomings. First, they contain a lot of information that is often not used when Wikipedia texts are used as knowledge (e.g., ids of users who changed each article, timestamps of article modifications). On the other hand, the XML dumps do not contain a lot of useful information that could be inferred from the dump, such as link tables, category hierarchy, resolution of redirection links etc.

In the course of my Ph.D. work, I developed a fairly extensive preprocessor of the standard Wikipedia XML dump into my own extended XML format, which eliminates some information and adds other useful information.

Description

WikiPrep is a single Perl script, which can be downloaded here.

Prerequisites

To run the script, you will need to have the following Perl modules installed on your system:

Usage

perl wikiprep.pl -f <XML-dump-file>

If the input file is named XXX.xml, then the following files will be produced:

What else is available

Note: Of course, you should always strive to use the latest Wikipedia snapshot; unfortunately, I do not have enough storage (nor bandwidth) capacity to provide you with preprocessed versions of the latest dumps. The script has been tested on the snapshot dated July 19, 2007, and produced about 9 Gb worth of output files (not counting the log file of about 40+ Gb).

Detailed description

The preprocessor script accomplishes the following tasks:

Running time

Wikipedia dumps are huge, so preprocessing them takes time. Just to give you an idea of what to expect (on a dual-core ~2 GHz Intel computer):

Further reading

The following pages will help you better understand Wikipedia markup and data encoding:

Conditions of use

This software is distributed under the terms of GNU General Public License version 2. The software is provided on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

If you publish results based on this code, please cite the following papers:

Please also inform your readers of the current location of the software: http://www.cs.technion.ac.il/~gabr/resources/code/wikiprep

Support

Well, it's free, so don't expect too much support :) I will likely be able to answer simple questions, but not complex programming questions (please refer to your local Perl guru). I do not promise to correct bugs, but I will try to do my best, especially if you suggest specific ways to correct the bug you encountered (in which case your contribution will, of course, be acknowledged).

References

  1. Evgeniy Gabrilovich and Shaul Markovitch
    "Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis"
    Proceedings of The 20th International Joint Conference on Artificial Intelligence (IJCAI), Hyderabad, India, January 2007
    [Abstract / PDF]

  2. Evgeniy Gabrilovich and Shaul Markovitch
    "Overcoming the Brittleness Bottleneck using Wikipedia: Enhancing Text Categorization with Encyclopedic Knowledge"
    Proceedings of The 21st National Conference on Artificial Intelligence (AAAI), pp. 1301-1306, Boston, July 2006
    [Abstract / PDF]

  3. Evgeniy Gabrilovich
    "Feature Generation for Textual Information Retrieval Using World Knowledge"
    PhD Thesis, Technion - Israel Institute of Technology, Haifa, Israel, December 2006
    [Abstract / PDF]


Evgeniy Gabrilovich
gabr@cs.technion.ac.il

Last updated on November 2, 2010