Nova Resource:Dumps/Documentation: Difference between revisions
Update documentation |
Move some docs to Dumps/Archive.org |
||
Line 12: | Line 12: | ||
=== Introduction === |
=== Introduction === |
||
This project was created to provide a dedicated space just for transferring Wikimedia dump files to the [https://archive.org/ Internet Archive]. These dumps were created as a possible backup in the case of cluster-wide hardware failure, and its also often used by researchers/bots. Sometimes, these files are generated for forking of any Wikimedia project, when lots of people of a project has different aims from the original Wikimedia goal. |
This project was created to provide a dedicated space just for transferring Wikimedia dump files to the [https://archive.org/ Internet Archive]. These dumps were created as a possible backup in the case of cluster-wide hardware failure, and its also often used by researchers/bots. Sometimes, these files are generated for forking of any Wikimedia project, when lots of people of a project has different aims from the original Wikimedia goal. |
||
⚫ | |||
=== Data currently being archived === |
=== Data currently being archived === |
||
Line 29: | Line 31: | ||
* Currently all heavy operations are conducted on /data/scratch/. We currently keep to a soft limit of using only 3 TB of space, but such disk usage is always temporary and will be deleted once the data is pushed to the Archive. |
* Currently all heavy operations are conducted on /data/scratch/. We currently keep to a soft limit of using only 3 TB of space, but such disk usage is always temporary and will be deleted once the data is pushed to the Archive. |
||
* Everything is retained locally only for very short periods, just the time needed for packing on archive.org. |
* Everything is retained locally only for very short periods, just the time needed for packing on archive.org. |
||
=== Code === |
|||
The source code for all the files used in this project are available on GitHub. These code might (in the future) find its way into the Wikimedia Gerrit repository, but there is no plans on doing so right now. |
|||
⚫ | |||
==== Caveats ==== |
|||
* We watch for collateral damage on ganglia where possible. |
|||
* Never ''assume'' some data is safe. If we didn't archive something and you can archive it before us, do so! Use archive.org and we'll notice, filling any gap. |
|||
==== Internet Archive tips ==== |
|||
* It's fine to upload to "opensource" collection with "wikiteam" keyword and let the collection admins among us to sort it later. |
|||
* Every new archival code should use the IA library: https://pypi.python.org/pypi/internetarchive |
|||
* On duplication: first of all, be thankful of Internet Archive's generosity and efficiency with little funding. Second, SketchCow> [...] uploading stuff of dubious value or duplication to archive.org: [...] gigabytes fine, tens of gigabytes problematic, hundreds of gigabytes bad. |
|||
** For instance, it's probably pointless to archive two copies of the same XML, one compressed in 7z and one in bz2. Just archive the 7z copy; fast consumption needing bzcat and whatever will rely on the original site. |
|||
* As of summer 2014, upload is one or two orders of magnitude faster than it used to be. It's not uncommon to reach 350 Mb/s upstream to s3.us.archive.org. |
|||
* Ask more on #wikiteam or #internetarchive at EFNet for informal chat, or on the archive.org forums for discoverability. |
|||
* Especially when the files are huge, remember to disable automatic derive: it creates data transfer for no gain. |
|||
==== Metadata ==== |
|||
As for metadata, it's important to keep it correct and consistent, if not rich, so that things are easy to find, bulk download and link. |
|||
* For pageview stats we follow this template: https://archive.org/details/wikipedia_visitor_stats_201001 |
|||
* For incremental dumps something like this: https://archive.org/details/incr-idwiktionary-20150115 |
|||
* For regular dumps (usually $wikiid-$timestamp as identifier) it's important to be precise: |
|||
**<description>Database backup dump provided by the Wikimedia Foundation: a complete copy of the wiki content, in the form of wikitext source and metadata embedded in XML.</description> |
|||
**<licenseurl>https://creativecommons.org/licenses/by-sa/3.0/</licensurl> |
|||
**<contributor>Wikimedia Foundation</contributor> |
|||
**<subject>wiki; MediaWiki; Wikimedia projects; dumps</subject> |
|||
**<rights>https://dumps.wikimedia.org/legal.html</rights> |
|||
**Optional: |
|||
***<creator>Wikimedia projects editors</creator> |
|||
***<subject>data dumps; idwiktionary; Wiktionary</subject> (database name like "idwiktionary" or "fiwiki" is easy, project name is not obvious when the database name ends with "wiki", e.g. commonswiki is Wikimedia Commons). |
|||
***<language>id</language> (same caveats as above and [https://meta.wikimedia.org/wiki/Special_language_codes even more]) |
|||
=== Links === |
=== Links === |
Revision as of 13:01, 12 July 2015
Dumps
Description
This is a project that archives the public datasets generated by Wikimedia.
Purpose
Archive the public Wikimedia datasets.
Anticipated time span
indefinite
Project status
currently running
Contact address
https://groups.google.com/forum/#!forum/wikiteam-discuss
Willing to take contributors or not
not willing
Subject area narrow or broad
broad
Project information
Introduction
This project was created to provide a dedicated space just for transferring Wikimedia dump files to the Internet Archive. These dumps were created as a possible backup in the case of cluster-wide hardware failure, and its also often used by researchers/bots. Sometimes, these files are generated for forking of any Wikimedia project, when lots of people of a project has different aims from the original Wikimedia goal.
More information about the archiving process is available at Dumps/Archive.org
Data currently being archived
Here are some information and links regarding the data that this project is archiving:
- Wikimedia main database dumps
- Wikimedia incremental dumps
- Wikidata JSON dumps
- Wikimania videos
- OpenStreetMap datasets
Servers
- dumps-N (where N is an integer): Main archiving servers
- dumps-stats: Wikimedia data manipulation, including dumps above and other stuff of relevance for Wikimedia research.
Storage:
- Before the eqiad migration we used to have a 900 GB quota (hardly sufficient for comfortable work).
- Currently all heavy operations are conducted on /data/scratch/. We currently keep to a soft limit of using only 3 TB of space, but such disk usage is always temporary and will be deleted once the data is pushed to the Archive.
- Everything is retained locally only for very short periods, just the time needed for packing on archive.org.