Nova Resource:Dumps/Documentation: Difference between revisions

From Wikitech
Content deleted Content added
Hydriz (talk | contribs)
No edit summary
Hydriz (talk | contribs)
Update documentation
Line 12: Line 12:
=== Introduction ===
=== Introduction ===
This project was created to provide a dedicated space just for transferring Wikimedia dump files to the [https://archive.org/ Internet Archive]. These dumps were created as a possible backup in the case of cluster-wide hardware failure, and its also often used by researchers/bots. Sometimes, these files are generated for forking of any Wikimedia project, when lots of people of a project has different aims from the original Wikimedia goal.
This project was created to provide a dedicated space just for transferring Wikimedia dump files to the [https://archive.org/ Internet Archive]. These dumps were created as a possible backup in the case of cluster-wide hardware failure, and its also often used by researchers/bots. Sometimes, these files are generated for forking of any Wikimedia project, when lots of people of a project has different aims from the original Wikimedia goal.

=== Data currently being archived ===
Here are some information and links regarding the data that this project is archiving:
* Wikimedia main database dumps
* Wikimedia incremental dumps
* [https://dumps.wikimedia.org/other/wikidata Wikidata JSON dumps]
* Wikimania videos
* OpenStreetMap datasets


=== Servers ===
=== Servers ===
* dumps-N (where N is an integer): Archives the main data that Wikimedia generates, including the whole dumps.wikimedia.org.
* dumps-N (where N is an integer): Main archiving servers
* dumps-stats: Wikimedia data manipulation, including dumps above and other stuff of relevance for Wikimedia research.
* dumps-incr: Secondary host for archiving the incremental media tarballs & backup for archiving the add change dumps.
* dumps-stats*: Wikimedia data manipulation, including dumps above and other stuff of relevance for Wikimedia research.


Storage:
Storage:
* Before the eqiad migration we used to have a 900 GB quota (hardly sufficient for comfortable work).
* Before the eqiad migration we used to have a 900 GB quota (hardly sufficient for comfortable work).
* Currently all heavy operations are conducted on /data/scratch/.
* Currently all heavy operations are conducted on /data/scratch/. We currently keep to a soft limit of using only 3 TB of space, but such disk usage is always temporary and will be deleted once the data is pushed to the Archive.
* Everything is retained locally only for very short periods, just the time needed for packing on archive.org.
* Everything is retained locally only for very short periods, just the time needed for packing on archive.org.


Line 44: Line 51:
As for metadata, it's important to keep it correct and consistent, if not rich, so that things are easy to find, bulk download and link.
As for metadata, it's important to keep it correct and consistent, if not rich, so that things are easy to find, bulk download and link.
* For pageview stats we follow this template: https://archive.org/details/wikipedia_visitor_stats_201001
* For pageview stats we follow this template: https://archive.org/details/wikipedia_visitor_stats_201001
* For incremental dumps something like this: https://archive.org/details/incr-idwiktionary-20140131
* For incremental dumps something like this: https://archive.org/details/incr-idwiktionary-20150115
* For regular dumps (usually $wikiid-$timestamp as identifier) it's important to be precise:
* For regular dumps (usually $wikiid-$timestamp as identifier) it's important to be precise:
**<description>Database backup dump provided by the Wikimedia Foundation: a complete copy of the wiki content, in the form of wikitext source and metadata embedded in XML.</description>
**<description>Database backup dump provided by the Wikimedia Foundation: a complete copy of the wiki content, in the form of wikitext source and metadata embedded in XML.</description>
Line 58: Line 65:
=== Links ===
=== Links ===
* [http://dumps.wikimedia.org/ Wikimedia dumps server]
* [http://dumps.wikimedia.org/ Wikimedia dumps server]
* [http://www.archive.org/details/wikimediadownloads Wikimedia Downloads collection on the Internet Archive]
* [http://planet.openstreetmap.org/ OpenStreetMap datasets]
* [http://archive.org/details/wikimediadownloads Wikimedia Downloads collection on the Internet Archive]
* [http://archive.org/details/osmdata OpenStreetMap data collection]

Revision as of 00:29, 17 January 2015

Dumps

Description

This is a project that archives the public datasets generated by Wikimedia.

Purpose

Archive the public Wikimedia datasets.

Anticipated time span

indefinite

Project status

currently running

Contact address

https://groups.google.com/forum/#!forum/wikiteam-discuss

Willing to take contributors or not

not willing

Subject area narrow or broad

broad

Project information

Introduction

This project was created to provide a dedicated space just for transferring Wikimedia dump files to the Internet Archive. These dumps were created as a possible backup in the case of cluster-wide hardware failure, and its also often used by researchers/bots. Sometimes, these files are generated for forking of any Wikimedia project, when lots of people of a project has different aims from the original Wikimedia goal.

Data currently being archived

Here are some information and links regarding the data that this project is archiving:

  • Wikimedia main database dumps
  • Wikimedia incremental dumps
  • Wikidata JSON dumps
  • Wikimania videos
  • OpenStreetMap datasets

Servers

  • dumps-N (where N is an integer): Main archiving servers
  • dumps-stats: Wikimedia data manipulation, including dumps above and other stuff of relevance for Wikimedia research.

Storage:

  • Before the eqiad migration we used to have a 900 GB quota (hardly sufficient for comfortable work).
  • Currently all heavy operations are conducted on /data/scratch/. We currently keep to a soft limit of using only 3 TB of space, but such disk usage is always temporary and will be deleted once the data is pushed to the Archive.
  • Everything is retained locally only for very short periods, just the time needed for packing on archive.org.

Code

The source code for all the files used in this project are available on GitHub. These code might (in the future) find its way into the Wikimedia Gerrit repository, but there is no plans on doing so right now.

More information is available at Dumps/Archive.org.

Caveats

  • We watch for collateral damage on ganglia where possible.
  • Never assume some data is safe. If we didn't archive something and you can archive it before us, do so! Use archive.org and we'll notice, filling any gap.

Internet Archive tips

  • It's fine to upload to "opensource" collection with "wikiteam" keyword and let the collection admins among us to sort it later.
  • Every new archival code should use the IA library: https://pypi.python.org/pypi/internetarchive
  • On duplication: first of all, be thankful of Internet Archive's generosity and efficiency with little funding. Second, SketchCow> [...] uploading stuff of dubious value or duplication to archive.org: [...] gigabytes fine, tens of gigabytes problematic, hundreds of gigabytes bad.
    • For instance, it's probably pointless to archive two copies of the same XML, one compressed in 7z and one in bz2. Just archive the 7z copy; fast consumption needing bzcat and whatever will rely on the original site.
  • As of summer 2014, upload is one or two orders of magnitude faster than it used to be. It's not uncommon to reach 350 Mb/s upstream to s3.us.archive.org.
  • Ask more on #wikiteam or #internetarchive at EFNet for informal chat, or on the archive.org forums for discoverability.
  • Especially when the files are huge, remember to disable automatic derive: it creates data transfer for no gain.

Metadata

As for metadata, it's important to keep it correct and consistent, if not rich, so that things are easy to find, bulk download and link.

  • For pageview stats we follow this template: https://archive.org/details/wikipedia_visitor_stats_201001
  • For incremental dumps something like this: https://archive.org/details/incr-idwiktionary-20150115
  • For regular dumps (usually $wikiid-$timestamp as identifier) it's important to be precise:
    • <description>Database backup dump provided by the Wikimedia Foundation: a complete copy of the wiki content, in the form of wikitext source and metadata embedded in XML.</description>
    • <licenseurl>https://creativecommons.org/licenses/by-sa/3.0/</licensurl>
    • <contributor>Wikimedia Foundation</contributor>
    • <subject>wiki; MediaWiki; Wikimedia projects; dumps</subject>
    • <rights>https://dumps.wikimedia.org/legal.html</rights>
    • Optional:
      • <creator>Wikimedia projects editors</creator>
      • <subject>data dumps; idwiktionary; Wiktionary</subject> (database name like "idwiktionary" or "fiwiki" is easy, project name is not obvious when the database name ends with "wiki", e.g. commonswiki is Wikimedia Commons).
      • <language>id</language> (same caveats as above and even more)

Links