Commons:Bots/Requests/SchlurcherBot9

SchlurcherBot (talk · contribs) 9 (Update to Request 8)

Operator: Schlurcher (talk · contributions · Statistics · Recent activity · block log · User rights log · uploads · Global account information)

Bot's tasks for which permission is being sought: Add structured data based on information provided on file description page according to Commons:Structured data/Modeling

Automatic or manually assisted: Automatic

Edit type (e.g. Continuous, daily, one time run): One time run at increased speed.

Maximum edit rate (e.g. edits per minute): 30 50 then 100 then 200 (if feasible)

Bot flag requested: (Y/N): N (Bot has flag already)

Programming language(s): Bash + QuickStatements, later Pywikibot + Python API Calls (operated locally and on en:Microsoft Azure with Commons:IP block exemption)

Schlurcher (talk) 14:22, 6 January 2020 (UTC)[reply]

Discussion

This request is motivated by a discussion at Commons_talk:Structured_data#Structured_copyright_and_licensing_for_search_indexing. Summary:

  • The development team is trying to use structured copyright and licensing information to improve search experience
  • Currently there are only a few bots that add structured data to files and completion of the task will take years
  • I have been asked if it is possible to drastically increase edit speed
  • Database administrators are okay to slowly ramp up from the operations side whenever the community is ready

This request will use the same code as my request 8. Previous request for reference:

Commons:Bots/Requests/SchlurcherBot8
===SchlurcherBot (talk · contribs) ===


Operator: Schlurcher (talk · contributions · Statistics · Recent activity · block log · User rights log · uploads · Global account information)

Bot's tasks for which permission is being sought: Add structured data based on information provided on file description page according to Commons:Structured data/Modeling

Examples
  • Adding P7482 / Q66458942 (own work by original uploader) to files that were uploaded by the author and declared as own work
Out of scope:
  • Any information that cannot be derived from the Commons file description page (like information linked to the picture ID in Wikidata)
  • Copying descriptions to captions (due to the discussion on Village Pump regarding this)

Automatic or manually assisted: Automatic

Edit type (e.g. Continuous, daily, one time run): Batches based on prepared lists

Maximum edit rate (e.g. edits per minute): 30

Bot flag requested: (Y/N): N (Bot has flag already)

Programming language(s): Bash + QuickStatements, later Pywikibot

Schlurcher (talk) 14:22, 6 January 2020 (UTC)[reply]

Discussion

This request is motivated by a discussion on User_talk:JarektBot#original_creation_by_uploader. The intended task is to add structured data based on information provided on file description page. The first task envisoned is to add P7482 / Q66458942 (own work by original uploader) to all files that fulfill the following requirements:

  1. File does not have property P7482 / Q66458942 already
  2. File uses template {{Own}} in the source field of the information template
  3. Username of the uploader is equal to or par of the username given in author field of the information template

Currently, I will use a bash script that queries the Commons API to check for all 3 conditions. Edits will be added to a list that can be processed through QuickStatements batch runs. A first batch run was performed under my username: Bot Test Run. Moving forward the edits are planned under the bot username (with flag). Further structured data statements are expected to be added. The task is similar to the recent actions of BotMultichill (talk · contribs) and BotMultichillT (talk · contribs). However, Multichill (talk · contribs) works on files that use both {{Own}} and {{Self}} (without checking for condition number 3 above). So some broader coverage is expected with this task. The task is expected to be broadned over time to include structured data derived from the author, source, data and license infromation. Once structured data on Commons is properly implemented in Pywikibot, the bot might switch to this framework (as used for the other tasks of the bot). --Schlurcher (talk) 14:22, 6 January 2020 (UTC)[reply]

Please make a test run. --EugeneZelenko (talk) 15:45, 6 January 2020 (UTC)[reply]
I have completed the test run on my main account. Please see here: Bot Test Run. I need to get autoconfirmed status with my bot on wikidata in order to be able to use Quick Statements on Commons. I'll soon complete a test run on the bot account. --Schlurcher (talk) 18:56, 6 January 2020 (UTC)[reply]
Can you share a link to the source code like this? Multichill (talk) 20:11, 6 January 2020 (UTC)[reply]
I do not have a link to the source code. The bash script runs on a single file name and perform the following actions (sequentially with check for no errors):
  1. Contact Commons API to get file desciption information in XML format. Extract Uploader, Author and Source.
  2. Contact CommonsEntities API to get sturcutred data and media ID in JSON format. Extract media ID and existing statement information
  3. Check and add statements for QS:
  1. Check that there is no existing statement
  2. Check that condition described in this request are fulfilled.
  3. Wirite out statement to be used in QS
This way I can repeat the script for each file of concern. The file list was downloaded from: Wikimedia Commons Dumps (all file names). Hope this helps. --Schlurcher (talk) 14:06, 8 January 2020 (UTC)[reply]
In case your bot can do more than 30/min it would be great to increase the rate, since at 30/minute it would take about 2 years to do 27M files. Also I would also suggest to add the creator (P170) as well, since you already know who that user is. There is a page Commons:Structured data/Modeling/Author discussing ways to model the author, but there is more activity on the talk page. I proposed a new property author's wikimedia username to make adding author information easier, but the proposal was rejected. Multichill (talk · contribs) favors a different scheme, which we should probably adopt. However we should probably make sure, the discussion on that page reaches some sort of consensus and that Commons:Structured data/Modeling/Author is updated. When I was adding some P7482s, the discussion was still ongoing. --Jarekt (talk) 20:34, 6 January 2020 (UTC)[reply]
@Jarekt: Thanks for highlighting this. I would prefer that the discussion reaches some sort of consensus before we start rolling this out. So, for now, I prioritized a date implementation, which I did map to inception (P571) --Schlurcher (talk) 15:11, 8 January 2020 (UTC)[reply]
Commons:Structured data/Modeling/Date is a bit more clear, the only potential issue is that inception (P571) and wikidata does not handle dates with hours, minutes and seconds (HH:MM:SS), as the highest precision is day. The string encoding date looks like "+2020-01-08T00:00:00Z" and one could add HH:MM:SS to it, but since the string always ends with "Z" it suggests that the time is in UTC timezone (the time in London), while on Commons HH:MM:SS is assumed to be in unspecified local timezone. So for time being I would only work with dates that do not specify time of the day. See Help:Dates. --Jarekt (talk) 16:59, 8 January 2020 (UTC)[reply]
Thanks. I have implemented dates with a precision of day, month and year. Dates with precision to the hours will be skipped. I have also added Commons:Structured data/Modeling to the task description for clarity. --Schlurcher (talk) 19:27, 8 January 2020 (UTC)[reply]

@EugeneZelenko: I have performed a test run on the bot account. Results are here: [1] --Schlurcher (talk) 19:21, 8 January 2020 (UTC)[reply]

Looks OK for me. Pity that time from EXIF could not be used. --EugeneZelenko (talk) 15:01, 9 January 2020 (UTC)[reply]

If there are no objections, I think task should be approved. --EugeneZelenko (talk) 16:17, 15 January 2020 (UTC)[reply]

@EugeneZelenko: I did compare my result with the results from BotMultichillT (talk · contribs) as well as with his code. As far as I see we will add the same information, so there should be no conflict between the bots. We use slightly different page generators, so there should be some synergy. My bot will not add author information, as this information is currently not correctly displayed on the structured data tab in commons. The qualifiers are currently not displayed which makes this difficult to read. See File:Coucouron - église 01.JPG as an example. The author information on structured data is blank even though BotMultichillT added this in the backend. So I leave this to BotMultichillT until it is fixed for Commons. Generally, I do agree and think that my bot is ready. --Schlurcher (talk) 08:35, 17 January 2020 (UTC)[reply]

The change is that I plan to run additional versions of the same script in parallel as well as making use of cloud resources (en:Microsoft Azure). Updates are marked in bold above. A staggered approach as follows is envisioned:

  1. 50 edits per minute (which can be achieved by my personal infrastructure)
  2. 100 edits per minute (personal infrastructure + cloud infrastructure
  3. 200 edits per minute (personal infrastructure + additional cloud infrastructure)
  4. Additional clone of this bot (not coved by this request and likely would need to be done by a separate volunteer)

Use of cloud resources is required starting from stage 2. The use of cloud resources requires Commons:IP block exemption, which has been granted by Taivo per standard process. Stage 3. would go beyond maximum bot activity recorded on commons and would likely require closer monitoring by the infrastructure team. Stage 4 will be part of a separate discussion and request. This request is to get community support for increasing the maximum edit rate as compared to my previous approved request. --Schlurcher (talk) 21:03, 6 August 2020 (UTC)[reply]

@Keegan (WMF), Bjh21, Jarekt, Tacsipacsi, Multichill, Mmullie (WMF), Gestumblindi, and Taivo: You have been involved in the discussions that lead to this proposal. Please share your thoughts. --Schlurcher (talk) 21:03, 6 August 2020 (UTC)[reply]
  • If this is just about being allowed to edit with higher speed then I   Support it. I agree that it is a good idea to start slow and monitor when going to the next step. --MGA73 (talk) 21:54, 6 August 2020 (UTC)[reply]
Thanks, MGA73. Note that the staggered increase in edit rate is not only to monitor system performance, but also to monitor the community reaction. Whereas all edits are marked as bot edits, as far as I know, it is not possible to mark these kind of edits as minor edits. So people that show bot edits on their watchlists have some more difficulty to filter these edits out. There is a risk that users get flooded with these edits. --Schlurcher (talk) 11:38, 7 August 2020 (UTC)[reply]
@Schlurcher: Thanks for the info. I have 52k pages on my watchlist so I see your bot all the time :-) But I prefer flooding over having a nice watchlist. But if we want we can probably code our commons.js so it ignores changes made by your bot. --MGA73 (talk) 11:45, 7 August 2020 (UTC)[reply]
I found this User_talk:INaturalistReviewBot#Watchlist but I have not testet it. --MGA73 (talk) 11:47, 7 August 2020 (UTC)[reply]
  •   Support This is something that we asked Schlurcher to do and we are grateful that he agreed to look into it. As a reference 100 edits/min means ~1M edits/week and Commons has 63M files. I am not sure how many files meet Schlurcher's bot conditions, but at stage 2 would take about a year of continuous running. Also as a reference, I often use widely used d:Help:QuickStatements (QS) tool for much simpler tasks of adding single SDC statements. With QS you do not have a control over speed since the tool running on tookserver paces itself based on load on the servers. Its speed can vary a lot but in some cases I clocked it at over 180 edits per minute. However, that pace is usually short lived as QS work with individually loaded batches of images and any batch bigger than 25K times out during job loading. --Jarekt (talk) 04:10, 7 August 2020 (UTC)[reply]
  • I don't get it. If you modify the Pywikibot settings a bit, one stupid old laptop can do 100s of edits per minute. I understand you don't want to leave on your laptop 24x7. Why use Azure and not Toolforge? The whole replag thing is related to Wikidata and that that the query service can't keep up. We don't have that problem here as our local query service is only updated about once a week. Multichill (talk) 07:46, 7 August 2020 (UTC)[reply]
Hi Multichill, there are a couple of reasons. 1) I'm not using a laptop but I am using a dedicated raspberry pi to run this operation. This is mainly due to power consumption considerations. 2.) Azure offers a free 12 month trial for two vCPUs that I can use to run this 24x7. 3.) I have zero experience with Toolforge and limited motivation to learn it. On the other hand, I had interest to learn Azure for a long time and when the discussion started, I thought this was finally a good opportunity do so. 4.) My motivation to run this bot is also to learn system administration, system monitoring and process management and improve my linux/cloud skills. I like to have full control over the environment. --Schlurcher (talk) 08:19, 7 August 2020 (UTC)[reply]

  Info based on the support seen so far, I have increased edit rate to approximately 50 edits per minute. Please let me know if there are any concerns. --Schlurcher (talk) 18:38, 13 August 2020 (UTC)[reply]


Approved. --Krd 05:55, 14 August 2020 (UTC)[reply]