Nova Resource:Dumps/Documentation: Difference between revisions

From Wikitech
Content deleted Content added
Hydriz (talk | contribs)
Update documentation
Hydriz (talk | contribs)
Move some docs to Dumps/Archive.org
Line 12: Line 12:
=== Introduction ===
=== Introduction ===
This project was created to provide a dedicated space just for transferring Wikimedia dump files to the [https://archive.org/ Internet Archive]. These dumps were created as a possible backup in the case of cluster-wide hardware failure, and its also often used by researchers/bots. Sometimes, these files are generated for forking of any Wikimedia project, when lots of people of a project has different aims from the original Wikimedia goal.
This project was created to provide a dedicated space just for transferring Wikimedia dump files to the [https://archive.org/ Internet Archive]. These dumps were created as a possible backup in the case of cluster-wide hardware failure, and its also often used by researchers/bots. Sometimes, these files are generated for forking of any Wikimedia project, when lots of people of a project has different aims from the original Wikimedia goal.

More information about the archiving process is available at [[Dumps/Archive.org]]


=== Data currently being archived ===
=== Data currently being archived ===
Line 29: Line 31:
* Currently all heavy operations are conducted on /data/scratch/. We currently keep to a soft limit of using only 3 TB of space, but such disk usage is always temporary and will be deleted once the data is pushed to the Archive.
* Currently all heavy operations are conducted on /data/scratch/. We currently keep to a soft limit of using only 3 TB of space, but such disk usage is always temporary and will be deleted once the data is pushed to the Archive.
* Everything is retained locally only for very short periods, just the time needed for packing on archive.org.
* Everything is retained locally only for very short periods, just the time needed for packing on archive.org.

=== Code ===
The source code for all the files used in this project are available on GitHub. These code might (in the future) find its way into the Wikimedia Gerrit repository, but there is no plans on doing so right now.

More information is available at [[Dumps/Archive.org]].

==== Caveats ====
* We watch for collateral damage on ganglia where possible.
* Never ''assume'' some data is safe. If we didn't archive something and you can archive it before us, do so! Use archive.org and we'll notice, filling any gap.

==== Internet Archive tips ====
* It's fine to upload to "opensource" collection with "wikiteam" keyword and let the collection admins among us to sort it later.
* Every new archival code should use the IA library: https://pypi.python.org/pypi/internetarchive
* On duplication: first of all, be thankful of Internet Archive's generosity and efficiency with little funding. Second, SketchCow> [...] uploading stuff of dubious value or duplication to archive.org: [...] gigabytes fine, tens of gigabytes problematic, hundreds of gigabytes bad.
** For instance, it's probably pointless to archive two copies of the same XML, one compressed in 7z and one in bz2. Just archive the 7z copy; fast consumption needing bzcat and whatever will rely on the original site.
* As of summer 2014, upload is one or two orders of magnitude faster than it used to be. It's not uncommon to reach 350 Mb/s upstream to s3.us.archive.org.
* Ask more on #wikiteam or #internetarchive at EFNet for informal chat, or on the archive.org forums for discoverability.
* Especially when the files are huge, remember to disable automatic derive: it creates data transfer for no gain.

==== Metadata ====
As for metadata, it's important to keep it correct and consistent, if not rich, so that things are easy to find, bulk download and link.
* For pageview stats we follow this template: https://archive.org/details/wikipedia_visitor_stats_201001
* For incremental dumps something like this: https://archive.org/details/incr-idwiktionary-20150115
* For regular dumps (usually $wikiid-$timestamp as identifier) it's important to be precise:
**<description>Database backup dump provided by the Wikimedia Foundation: a complete copy of the wiki content, in the form of wikitext source and metadata embedded in XML.</description>
**<licenseurl>https://creativecommons.org/licenses/by-sa/3.0/</licensurl>
**<contributor>Wikimedia Foundation</contributor>
**<subject>wiki; MediaWiki; Wikimedia projects; dumps</subject>
**<rights>https://dumps.wikimedia.org/legal.html</rights>
**Optional:
***<creator>Wikimedia projects editors</creator>
***<subject>data dumps; idwiktionary; Wiktionary</subject> (database name like "idwiktionary" or "fiwiki" is easy, project name is not obvious when the database name ends with "wiki", e.g. commonswiki is Wikimedia Commons).
***<language>id</language> (same caveats as above and [https://meta.wikimedia.org/wiki/Special_language_codes even more])


=== Links ===
=== Links ===

Revision as of 13:01, 12 July 2015

Dumps

Description

This is a project that archives the public datasets generated by Wikimedia.

Purpose

Archive the public Wikimedia datasets.

Anticipated time span

indefinite

Project status

currently running

Contact address

https://groups.google.com/forum/#!forum/wikiteam-discuss

Willing to take contributors or not

not willing

Subject area narrow or broad

broad

Project information

Introduction

This project was created to provide a dedicated space just for transferring Wikimedia dump files to the Internet Archive. These dumps were created as a possible backup in the case of cluster-wide hardware failure, and its also often used by researchers/bots. Sometimes, these files are generated for forking of any Wikimedia project, when lots of people of a project has different aims from the original Wikimedia goal.

More information about the archiving process is available at Dumps/Archive.org

Data currently being archived

Here are some information and links regarding the data that this project is archiving:

  • Wikimedia main database dumps
  • Wikimedia incremental dumps
  • Wikidata JSON dumps
  • Wikimania videos
  • OpenStreetMap datasets

Servers

  • dumps-N (where N is an integer): Main archiving servers
  • dumps-stats: Wikimedia data manipulation, including dumps above and other stuff of relevance for Wikimedia research.

Storage:

  • Before the eqiad migration we used to have a 900 GB quota (hardly sufficient for comfortable work).
  • Currently all heavy operations are conducted on /data/scratch/. We currently keep to a soft limit of using only 3 TB of space, but such disk usage is always temporary and will be deleted once the data is pushed to the Archive.
  • Everything is retained locally only for very short periods, just the time needed for packing on archive.org.

Links