Page MenuHomePhabricator

Provide a mechanism for detecting duplicate files in enwiki and another wikipedia
Open, MediumPublic

Description

Wikipedias allow uploading non-free content under fair use. Smaller wikis often copy a fair-used photo from enwiki and upload it locally to be used in a similar article. On fawiki, we even have a bot (operated by @Yamaha5) that does that automatically and updates the infoboxes in fawiki articles based on the image found in the enwiki version.

There are two issues here: (a) sometimes, the enwiki file has a change of license--for example, it is originally uploaded for fair use, but is later deemed to be in public domain due to the threshold of originality assessments; this happens commonly with logos-- and we would want to know that the two analogous files are not both in a fair user category, and therefore the duplicate copy may need to be edited; (b) sometimes one of the two files is nominated for deletion or actually deleted, and this may need to be repeated for the duplicate copy as well.

Note that this doesn't need a real-time replica; one option might be to create a separate data store that contains a daily or weekly dump of the images table for all wikis. From here, one can write a script that finds the analogous files (by their name and/or SHA1 hash) and connects to the wikireplicas to fetch more data on them (e.g. what categories they are on, etc.)

Event Timeline

Andrew triaged this task as Medium priority.Dec 8 2020, 5:28 PM