Nova Resource:Dumps/Archive.org: Difference between revisions

From Wikitech
Content deleted Content added
Hydriz (talk | contribs)
Update
Hydriz (talk | contribs)
Some updates
Line 4: Line 4:


== Archiving ==
== Archiving ==
=== Labs ===
There is a project on Wikimedia Labs called "[[Nova_Resource:Dumps|Dumps]]" that is dedicated to running the archiving processes by volunteers. Currently, the datasets that are being archived are:
There is a project on Wikimedia Labs called "[[Nova_Resource:Dumps|Dumps]]" that is dedicated to running the archiving processes by volunteers. Currently, the datasets that are being archived are:
# [[Dumps/Adds-changes dumps|Adds/Changes dumps]] ([https://github.com/Hydriz/incrdumps source]) - Runs daily
# [[Dumps/Adds-changes dumps|Adds/Changes dumps]] ([https://github.com/Hydriz/incrdumps source]) - Runs automatically via crontab
# Main database dumps - Runs daily
# Main database dumps - Runs automatically via an archiving daemon
# Wikimedia visitor project statistics (hourly versions, grouped by month) - Manually run
# Wikimedia visitor project statistics (hourly versions, grouped by month) - Manually run
# Other available Wikimedia datasets


=== Code ===
Currently not running:
The source code for all the files used in this project are available on GitHub. These code might (in the future) find its way into the Wikimedia Gerrit repository, but there is no plans on doing so right now.
# [[Dumps/media|Incremental media tarballs]] ([https://github.com/Hydriz/MediaTarballs source]) - Runs 7 (or any other number, as long as zuwiktionary is complete) days after initial appearance [http://ftpmirror.your.org/pub/wikimedia/imagedumps/tarballs/incrs/tarballs/ here].

# Full media tarballs (at least for those <10GB)
The archiving of all datasets is managed by an archiving daemon under the project "Balchivist". It regularly scans for new dumps and feeds them into a database, which will be picked up by an "archive runner" which will archive the dump at a later stage. [https://github.com/Hydriz/Balchivist Code is available here].

== Metadata ==
As for metadata, it's important to keep it correct and consistent, if not rich, so that things are easy to find, bulk download and link.
* For pageview stats we follow this template: https://archive.org/details/wikipedia_visitor_stats_201001
* For incremental dumps something like this: https://archive.org/details/incr-idwiktionary-20150115
* For regular dumps (usually $wikiid-$timestamp as identifier) it's important to be precise, follow this template: https://archive.org/details/enwiki-20150304

=== Caveats ===
* We watch for collateral damage on ganglia where possible.
* Never ''assume'' some data is safe. If we didn't archive something and you can archive it before us, do so! Use archive.org and we'll notice, filling any gap. Please [[m:Talk:WikiTeam|ping us]] to have your item moved into the collection.

=== Internet Archive tips ===
* It's fine to upload to "opensource" collection with "wikiteam" keyword and let the collection admins among us to sort it later.
* Every new archival code should use the IA library: https://pypi.python.org/pypi/internetarchive
* On duplication: first of all, be thankful of Internet Archive's generosity and efficiency with little funding. Second, SketchCow> [...] uploading stuff of dubious value or duplication to archive.org: [...] gigabytes fine, tens of gigabytes problematic, hundreds of gigabytes bad.
** For instance, it's probably pointless to archive two copies of the same XML, one compressed in 7z and one in bz2. Just archive the 7z copy; fast consumption needing bzcat and whatever will rely on the original site.
* As of summer 2014, upload is one or two orders of magnitude faster than it used to be. It's not uncommon to reach 350 Mb/s upstream to s3.us.archive.org.
* Ask more on #wikiteam or #internetarchive at EFNet for informal chat, or on the archive.org forums for discoverability.
* Especially when the files are huge, remember to disable automatic derive: it creates data transfer for no gain.


== Development ==
== Development ==
=== Expansion of scope ===
=== Expansion of scope ===
* Archive the main database dumps: <font color="orange">In progress</font>
* Archive the main database dumps: <font color="green">Done</font>
* Archive the full media tarballs: <font color="orange">In progress</font>
** Script completed, now undergoing testing, large wikis (in large.dblist) are excluded
** A blueprint is currently being drafted on the best method to archive the media on ''all Wikimedia wikis'' (including Commons) while using minimal resources.
* Archive the full media tarballs: <font color="red">Not done</font>


=== Robustness ===
=== Robustness ===
* Improve overall usability of items on Archive.org: <font color="green">Done</font>
* Improve overall usability of items on Archive.org: <font color="green">Done</font>
* Better error handling/skipping errors: <font color="orange">In progress</font>
* Better error handling/skipping errors: <font color="green">Done</font>


=== Speed ===
=== Speed ===
* Implement parallelization: <font color="orange">In progress</font>
* Implement parallelization: <font color="green">Done</font>
** There are two instances of uploading for the add/change dumps and the main database dumps
* Tap into multipart uploading for S3: <font color="orange">In progress</font>
* Tap into multipart uploading for S3: <font color="orange">In progress</font>
** Multipart uploads are slower overall compared to using direct uploads. Work is also being done to ensure that multipart uploads can easily resume.
** Multipart uploading has been largely manual, needs work to make it automatic for the larger wikis.


[[Category:Dumps]]
[[Category:Dumps]]

Revision as of 13:11, 12 July 2015

Archive.org refers to the Internet Archive, which is a library of stuff, mainly scanned books, but can contain almost anything that is of free content.

We are currently working on moving the public datasets to the Archive for preservation, although right now its mainly being handled by volunteers (specifically Hydriz and Nemo).

Archiving

There is a project on Wikimedia Labs called "Dumps" that is dedicated to running the archiving processes by volunteers. Currently, the datasets that are being archived are:

  1. Adds/Changes dumps (source) - Runs automatically via crontab
  2. Main database dumps - Runs automatically via an archiving daemon
  3. Wikimedia visitor project statistics (hourly versions, grouped by month) - Manually run
  4. Other available Wikimedia datasets

Code

The source code for all the files used in this project are available on GitHub. These code might (in the future) find its way into the Wikimedia Gerrit repository, but there is no plans on doing so right now.

The archiving of all datasets is managed by an archiving daemon under the project "Balchivist". It regularly scans for new dumps and feeds them into a database, which will be picked up by an "archive runner" which will archive the dump at a later stage. Code is available here.

Metadata

As for metadata, it's important to keep it correct and consistent, if not rich, so that things are easy to find, bulk download and link.

Caveats

  • We watch for collateral damage on ganglia where possible.
  • Never assume some data is safe. If we didn't archive something and you can archive it before us, do so! Use archive.org and we'll notice, filling any gap. Please ping us to have your item moved into the collection.

Internet Archive tips

  • It's fine to upload to "opensource" collection with "wikiteam" keyword and let the collection admins among us to sort it later.
  • Every new archival code should use the IA library: https://pypi.python.org/pypi/internetarchive
  • On duplication: first of all, be thankful of Internet Archive's generosity and efficiency with little funding. Second, SketchCow> [...] uploading stuff of dubious value or duplication to archive.org: [...] gigabytes fine, tens of gigabytes problematic, hundreds of gigabytes bad.
    • For instance, it's probably pointless to archive two copies of the same XML, one compressed in 7z and one in bz2. Just archive the 7z copy; fast consumption needing bzcat and whatever will rely on the original site.
  • As of summer 2014, upload is one or two orders of magnitude faster than it used to be. It's not uncommon to reach 350 Mb/s upstream to s3.us.archive.org.
  • Ask more on #wikiteam or #internetarchive at EFNet for informal chat, or on the archive.org forums for discoverability.
  • Especially when the files are huge, remember to disable automatic derive: it creates data transfer for no gain.

Development

Expansion of scope

  • Archive the main database dumps: Done
  • Archive the full media tarballs: In progress
    • A blueprint is currently being drafted on the best method to archive the media on all Wikimedia wikis (including Commons) while using minimal resources.

Robustness

  • Improve overall usability of items on Archive.org: Done
  • Better error handling/skipping errors: Done

Speed

  • Implement parallelization: Done
  • Tap into multipart uploading for S3: In progress
    • Multipart uploads are slower overall compared to using direct uploads. Work is also being done to ensure that multipart uploads can easily resume.