Nova Resource:Dumps/Documentation: Difference between revisions
No edit summary |
Update documentation |
||
Line 12: | Line 12: | ||
=== Introduction === |
=== Introduction === |
||
This project was created to provide a dedicated space just for transferring Wikimedia dump files to the [https://archive.org/ Internet Archive]. These dumps were created as a possible backup in the case of cluster-wide hardware failure, and its also often used by researchers/bots. Sometimes, these files are generated for forking of any Wikimedia project, when lots of people of a project has different aims from the original Wikimedia goal. |
This project was created to provide a dedicated space just for transferring Wikimedia dump files to the [https://archive.org/ Internet Archive]. These dumps were created as a possible backup in the case of cluster-wide hardware failure, and its also often used by researchers/bots. Sometimes, these files are generated for forking of any Wikimedia project, when lots of people of a project has different aims from the original Wikimedia goal. |
||
=== Data currently being archived === |
|||
Here are some information and links regarding the data that this project is archiving: |
|||
* Wikimedia main database dumps |
|||
* Wikimedia incremental dumps |
|||
* [https://dumps.wikimedia.org/other/wikidata Wikidata JSON dumps] |
|||
* Wikimania videos |
|||
* OpenStreetMap datasets |
|||
=== Servers === |
=== Servers === |
||
* dumps-N (where N is an integer): |
* dumps-N (where N is an integer): Main archiving servers |
||
⚫ | |||
* dumps-incr: Secondary host for archiving the incremental media tarballs & backup for archiving the add change dumps. |
|||
⚫ | |||
Storage: |
Storage: |
||
* Before the eqiad migration we used to have a 900 GB quota (hardly sufficient for comfortable work). |
* Before the eqiad migration we used to have a 900 GB quota (hardly sufficient for comfortable work). |
||
* Currently all heavy operations are conducted on /data/scratch/. |
* Currently all heavy operations are conducted on /data/scratch/. We currently keep to a soft limit of using only 3 TB of space, but such disk usage is always temporary and will be deleted once the data is pushed to the Archive. |
||
* Everything is retained locally only for very short periods, just the time needed for packing on archive.org. |
* Everything is retained locally only for very short periods, just the time needed for packing on archive.org. |
||
Line 44: | Line 51: | ||
As for metadata, it's important to keep it correct and consistent, if not rich, so that things are easy to find, bulk download and link. |
As for metadata, it's important to keep it correct and consistent, if not rich, so that things are easy to find, bulk download and link. |
||
* For pageview stats we follow this template: https://archive.org/details/wikipedia_visitor_stats_201001 |
* For pageview stats we follow this template: https://archive.org/details/wikipedia_visitor_stats_201001 |
||
* For incremental dumps something like this: https://archive.org/details/incr-idwiktionary- |
* For incremental dumps something like this: https://archive.org/details/incr-idwiktionary-20150115 |
||
* For regular dumps (usually $wikiid-$timestamp as identifier) it's important to be precise: |
* For regular dumps (usually $wikiid-$timestamp as identifier) it's important to be precise: |
||
**<description>Database backup dump provided by the Wikimedia Foundation: a complete copy of the wiki content, in the form of wikitext source and metadata embedded in XML.</description> |
**<description>Database backup dump provided by the Wikimedia Foundation: a complete copy of the wiki content, in the form of wikitext source and metadata embedded in XML.</description> |
||
Line 58: | Line 65: | ||
=== Links === |
=== Links === |
||
* [http://dumps.wikimedia.org/ Wikimedia dumps server] |
* [http://dumps.wikimedia.org/ Wikimedia dumps server] |
||
* [http:// |
* [http://planet.openstreetmap.org/ OpenStreetMap datasets] |
||
* [http://archive.org/details/wikimediadownloads Wikimedia Downloads collection on the Internet Archive] |
|||
* [http://archive.org/details/osmdata OpenStreetMap data collection] |
Revision as of 00:29, 17 January 2015
Dumps
Description
This is a project that archives the public datasets generated by Wikimedia.
Purpose
Archive the public Wikimedia datasets.
Anticipated time span
indefinite
Project status
currently running
Contact address
https://groups.google.com/forum/#!forum/wikiteam-discuss
Willing to take contributors or not
not willing
Subject area narrow or broad
broad
Project information
Introduction
This project was created to provide a dedicated space just for transferring Wikimedia dump files to the Internet Archive. These dumps were created as a possible backup in the case of cluster-wide hardware failure, and its also often used by researchers/bots. Sometimes, these files are generated for forking of any Wikimedia project, when lots of people of a project has different aims from the original Wikimedia goal.
Data currently being archived
Here are some information and links regarding the data that this project is archiving:
- Wikimedia main database dumps
- Wikimedia incremental dumps
- Wikidata JSON dumps
- Wikimania videos
- OpenStreetMap datasets
Servers
- dumps-N (where N is an integer): Main archiving servers
- dumps-stats: Wikimedia data manipulation, including dumps above and other stuff of relevance for Wikimedia research.
Storage:
- Before the eqiad migration we used to have a 900 GB quota (hardly sufficient for comfortable work).
- Currently all heavy operations are conducted on /data/scratch/. We currently keep to a soft limit of using only 3 TB of space, but such disk usage is always temporary and will be deleted once the data is pushed to the Archive.
- Everything is retained locally only for very short periods, just the time needed for packing on archive.org.
Code
The source code for all the files used in this project are available on GitHub. These code might (in the future) find its way into the Wikimedia Gerrit repository, but there is no plans on doing so right now.
More information is available at Dumps/Archive.org.
Caveats
- We watch for collateral damage on ganglia where possible.
- Never assume some data is safe. If we didn't archive something and you can archive it before us, do so! Use archive.org and we'll notice, filling any gap.
Internet Archive tips
- It's fine to upload to "opensource" collection with "wikiteam" keyword and let the collection admins among us to sort it later.
- Every new archival code should use the IA library: https://pypi.python.org/pypi/internetarchive
- On duplication: first of all, be thankful of Internet Archive's generosity and efficiency with little funding. Second, SketchCow> [...] uploading stuff of dubious value or duplication to archive.org: [...] gigabytes fine, tens of gigabytes problematic, hundreds of gigabytes bad.
- For instance, it's probably pointless to archive two copies of the same XML, one compressed in 7z and one in bz2. Just archive the 7z copy; fast consumption needing bzcat and whatever will rely on the original site.
- As of summer 2014, upload is one or two orders of magnitude faster than it used to be. It's not uncommon to reach 350 Mb/s upstream to s3.us.archive.org.
- Ask more on #wikiteam or #internetarchive at EFNet for informal chat, or on the archive.org forums for discoverability.
- Especially when the files are huge, remember to disable automatic derive: it creates data transfer for no gain.
Metadata
As for metadata, it's important to keep it correct and consistent, if not rich, so that things are easy to find, bulk download and link.
- For pageview stats we follow this template: https://archive.org/details/wikipedia_visitor_stats_201001
- For incremental dumps something like this: https://archive.org/details/incr-idwiktionary-20150115
- For regular dumps (usually $wikiid-$timestamp as identifier) it's important to be precise:
- <description>Database backup dump provided by the Wikimedia Foundation: a complete copy of the wiki content, in the form of wikitext source and metadata embedded in XML.</description>
- <licenseurl>https://creativecommons.org/licenses/by-sa/3.0/</licensurl>
- <contributor>Wikimedia Foundation</contributor>
- <subject>wiki; MediaWiki; Wikimedia projects; dumps</subject>
- <rights>https://dumps.wikimedia.org/legal.html</rights>
- Optional:
- <creator>Wikimedia projects editors</creator>
- <subject>data dumps; idwiktionary; Wiktionary</subject> (database name like "idwiktionary" or "fiwiki" is easy, project name is not obvious when the database name ends with "wiki", e.g. commonswiki is Wikimedia Commons).
- <language>id</language> (same caveats as above and even more)