en.m.wikipedia.org
Wikipedia:Link rot/URL change requests
< Wikipedia:Link rot  (Redirected from Wikipedia:URLREQ)
This page is for requesting modifications to URLs, such as marking dead or changing to a new domain. Some bots are designed to fix link rot, they can be notified here, these include InternetArchiveBot and WaybackMedic. This page can be monitored by bot operators from other language wikis since URL changes are universally applicable.
Bot might convert links to httpS?
There are several thousand "http" links on WP to many different pages of my site (whose homepage is http://penelope.uchicago.edu/Thayer/E/home.html​) which really should be httpS. The site is secure with valid certificates, etc. Is this something a bot can take care of quickly?
24.136.4.218 (talk) 19:20, 11 February 2021 (UTC)
In general, I think a bot could replace http with https for all webpages, after some checks. (The guidelines prefer https over http, WP:External links#Specifying_protocols​.) My naive idea is to create a bot that goes through http-links and checks if they are also valid with https. If they are, then the bot can replace the http-link with the https-link. Apart from the question if there is a general problem with the idea, a few questions remain:
  1. Should the bot replace all links or only major ones (official webpage, info boxes, ...)?
  2. Should the bot only check if https works or if http and https provide the same page?
I would be happy to hear what others think about the idea. Nuretok (talk) 11:43, 5 April 2021 (UTC)
One argument against this is that, many websites implement a http -> https redirect. Thus if one access the link with http, it will be redirected to https. In this case, it would not matter what protocol the link is in WP, the user would always end up on https. Even the cited example above is redirected. -- Srihari Thalla (talk) 19:09, 8 April 2021 (UTC)
You are right, many websites forward http to https, but this still allows a Man-in-the-middle attack when someone prevents this redirect. This is one of the reasons the Wikipedia-guidelines recommend using https and browser plugins such as HTTPS Everywhere exist. Of course everyone is free to use https everywhere, but providing good defaults (https in this case) is usually considered good practice. By the way, instead of checking each site individually, there is a list of servers that support https which the bot could check to see if it wants to move from http to https.Nuretok (talk) 08:20, 17 April 2021 (UTC)
observer.com
I found many broken links to www.observer.com: some (but not all) of these links no longer lead to the articles that were originally cited. Jarble (talk) 21:04, 13 February 2021 (UTC)
Since this is a mix of live and dead probably better to leave it for IABot which should be able to detect the dead. -- GreenC 03:19, 14 February 2021 (UTC)
@GreenC: IABot won't detect them. I tried running IABot on this page, but the link is still incorrect. Jarble (talk) 21:35, 11 March 2021 (UTC)
IABot won't work. It's pretty complex. First impression is anything "https" is OK. Anything "http" without a hostname is also OK. That narrows it down to about a thousand possible trouble URLs. Of these, some work and some don't. Some are also redircting to spam links needing |url-status=unfit. There are patterns, but also exceptions. I might need to make a dry run, log what it does, build rules to take into account the mistakes, then make a live run. Hard to say up front what the rules should be. Will take some time to figure out, there are a lot of variables. -- GreenC 01:45, 12 March 2021 (UTC)
Results
The rest were already archived or still working or now tagged with
{{dead link}}
. Once the soft404 redirects were identified it was not too difficult. If you see any problems let me know. @Jarble: -- GreenC 21:39, 13 March 2021 (UTC)
sfsite.com/~silverag
My website, formerly located at www.sfsite.com/~silverag has moved to www.stevenhsilver.com. It is used as a citation on numerous wikipedia pages. If a bot could go through and replace the string sfsite.com/~silverag with stevenhsilver.com it would correct the broken links. Shsilver (talk) 12:57, 14 February 2021 (UTC)
Hi, the bot switched 108 URLs. There are 13 left the bot could not determine. -- GreenC 17:54, 14 February 2021 (UTC)
Thanks. Some of those switched, others pointed to pages I decided not to upload to the new site. I appreciate your work and your bot's. Shsilver (talk) 19:19, 14 February 2021 (UTC)
Illinois Historic Preservation Agency
Hello, the Illinois Historic Preservation Agency recently took down their website because it was based on Adobe Flash, breaking lots of links to documentation. I just checked a random one, and it was in the Internet Archive, so I assume that the link-changing bots could archive a good number of them. Could someone have a bot collect all the URLs of a form http://gis.hpa.state.il.us/pdfs/XXXXXX.pdf and run all of them through IA? "X" represents a numeral; some of these files may have five or fewer numerals (XXXX.pdf), or may have seven or more (XXXXXXXX.pdf), so please don't assume that they're all six digits.
Thanks! Nyttend (talk) 19:27, 16 February 2021 (UTC)
Hi Nyttend, results are in 1,151 articles, 1,035 archive URLs and 217 {{dead link}}'s added. Let me know if you see any problems. PDFs are the easiest as they either clearly work or not. -- GreenC 01:35, 17 February 2021 (UTC)
Thank you, GreenC. If you click any IHPA link (even my XXXXXX sample), you're taken to a page that says "A new version of HARGIS will be available in the coming weeks." (This was the case before I made this request; I asked because there's no guarantee that the new site will use the same linking structure for its PDFs.) Do you have a way of looking up where the 217 dead links are located? When I notice that they've put up the new version of the site, I may come back and ask for help getting the links working again, but only if you have a way of going through the ones that your bot handled, without un-archiving the 1035. Nyttend (talk) 12:19, 17 February 2021 (UTC)
In that case the 217 + 1035 might be live again (there are logs). Ping me when ready and will take a look. The bot can unwind archives, replace dead links with live, move URL schemes, retrieve new URLs from redirects, etc.. -- GreenC 15:39, 17 February 2021 (UTC)
whitehouse.gov
A lot of whitehouse.gov links have died after the domain recently "changed owner". A rare occasion where many Wikipedians may be glad for sources dying. There is an archive at https://trumpwhitehouse.archives.gov​. Example of old broken and new working url:
There is a slim chance/risk that some of the broken links will work again in about four years. Some whitehouse.gov links are working and should not be changed. Can a bot sort it out? PrimeHunter (talk) 13:09, 25 February 2021 (UTC)
Some older source links are archived at https://obamawhitehouse.archives.gov or https://georgewbush-whitehouse.archives.gov​.
Obama example of broken and working link:
Bush example of broken and working link:
Some links work via redirects:
https://www.whitehouse.gov/the-press-office/2013/06/24/daily-briefing-press-secretary-jay-carney-6242013
redirects to
https://obamawhitehouse.archives.gov/the-press-office/2013/06/24/daily-briefing-press-secretary-jay-carney-6242013
https://www.archives.gov/presidential-libraries/archived-websites also mentions Clinton archives. The newest is https://clintonwhitehouse5.archives.gov/ from January 2001. I don't know whether we have broken links it could fix.
A bot could test every whitehouse.gov link to see whether it works now or at any of the archives. PrimeHunter (talk) 14:02, 25 February 2021 (UTC)
OK, based on your research, I agree it's worth exploring to see how well it works. Will take a look. -- GreenC 14:25, 25 February 2021 (UTC)
Results: modified 8,263 URLs in 5,060 articles. Changed metadata info such as |work=whitehouse.gov​. Plus other general fixes by WaybackMedic. Matter of curiosity: 67% were found by the scanning method described above and the rest had working redirects in the header. Most of the working redirects were Obama, Trump had a high proportion of 404s and no redirects, perhaps poorly maintained and/or too soon after leaving office. Also some pages (10%?) can't be archived by any web archive service, they just don't work, there is something in the page that prevents web archiving by third parties but regardless they still work at the National Archives. @PrimeHunter: -- GreenC 16:46, 3 March 2021 (UTC)
@GreenC: Great! Thanks a lot. Do you have a list of broken links which couldn't be fixed? I noticed one in [1]: https://www.whitehouse.gov/the-press-office/2013/05/20/president-obama-announces-sally-ride-recipient-presidential-medal-freedom​. It redirects but the target doesn't work. Thanks for checking the redirect didn't help. It turned out to be our own fault. The real link [2] didn't have a final m which was added by a careless editor in [3], so there is no general fix we can learn from that. PrimeHunter (talk) 22:30, 3 March 2021 (UTC)
There were 30: Wikipedia:Link rot/cases/whitehouse.gov -- GreenC 22:55, 3 March 2021 (UTC)
@GreenC: Thanks. That's a nice low number. I have fixed many of them with guessing or Googling without finding a system. Some were clearly our own fault with url's that never would have worked. Should I remove the fixed ones from Wikipedia:Link rot/cases/whitehouse.gov​? PrimeHunter (talk) 02:21, 4 March 2021 (UTC)
Yes about 0.5% of the whitehouse URLs is explainable by local data entry or remote site errors, it's probably better than one might expect. It's a good idea to check for, and great you were able to fix some. Use the page any way you like, markup or delete entries. -- GreenC 03:12, 4 March 2021 (UTC)
StarWars.com
Anything with http://www.starwars.com should be changed to https. Thanks. JediMasterMacaroni (Talk) 18:20, 25 February 2021 (UTC)
Forwarded to User_talk:Bender235#StarWars.com -- GreenC 19:04, 25 February 2021 (UTC)
Will do. --bender235 (talk) 19:33, 25 February 2021 (UTC)
Thanks. JediMasterMacaroni (Talk) 19:34, 25 February 2021 (UTC)
Replace atimes.com links
Please replace all instances of atimes.com and its subdomains with asiatimes.com. The old website is replaced by an advertising site. ~ Ase1este​c​harge-parity​t​ime 10:11, 28 February 2021 (UTC)
Also, if the corresponding page with the new domain is not found, not archived, and there is an archive with the old domain, then do not replace the URL, but add the archive link and mark the URL status as unfit. Thanks. ~ Ase1este​c​harge-parity​t​ime 10:26, 28 February 2021 (UTC)
Ok. It might take a couple passes, first to move the domain where possible, and second to add the archives+unfit for the remainder. Still working on the whitehouse.gov above could be a few days at least. -- GreenC 15:46, 28 February 2021 (UTC)
Ok, thanks, I can wait. ~ Ase1este​c​harge-parity​t​ime 17:42, 28 February 2021 (UTC)
Results:
@Aseleste: I think that is all, if you see anything else let me know. -- GreenC 04:23, 6 March 2021 (UTC)
Looks good, thanks! ~ Ase1este​c​harge-parity​t​ime 04:28, 6 March 2021 (UTC)
www.geek.com
I found many broken links on this domain: is it possible to fix them automatically? Jarble (talk) 21:30, 11 March 2021 (UTC)
This is the same situation as observer.com -- in the IABot database the domain is set to Whitelisted thus the bot is not checking/fixing dead links. My bot can try, it's a lot easier than observer as the numbers are small and only requires checking for 404s. -- GreenC 01:51, 12 March 2021 (UTC)
unc.edu
Thread copied from WP:BOTREQ#Replace_dead_links
Please could someone replace ELs of the form
https://www.unc.edu/~rowlett/lighthouse/bhs.htm (a dead link)
with
{{Cite rowlett|bhs}}
which produces
Rowlett, Russ. "Lighthouses of the Bahamas". The Lighthouse Directory. University of North Carolina at Chapel Hill.
Thanks — Martin (MSGJ · talk) 05:38, 19 March 2021 (UTC)
What sort of scale of edits are we talking (tens, hundreds, thousands)? Primefac (talk) 14:37, 19 March 2021 (UTC)
Special:LinkSearch says 1054 for "​https://www.unc.edu/~rowlett/lighthouse​" and 483 for the "http://" variant. DMacks (talk) 14:43, 19 March 2021 (UTC)
But spot-checking, it's a mix of
{{cite web}}
, plain links, and links with piped text, and with/without additional plain bibliographic notes. For example, 165 of the https:// form are in a "url=..." context. I think there are too many variations to do automatically. DMacks (talk) 15:06, 19 March 2021 (UTC)
MSGJ, the only type that can be converted are
{{cite web}}
as noted by User:DMacks it's too messy to determine the square and bare links due to free form text that might be surrounding the URL, unless there is some discernible pattern. There are 334 articles that contain a preceding "url=". Couple questions:
-- GreenC 19:19, 19 March 2021 (UTC)
Thanks for looking into this GreenC. I asked at Template talk:Cite rowlett and the working ibiblio.org links almost exactly correspond to the old unc.edu/~rowlett links. I'm not sure what to do ith archive links. Keep them if they are working? The use of
{{Cite rowlett}}
would be preferable, where possible, but if not, then the bare links can just be replaced. Thanks — Martin (MSGJ · talk) 21:49, 22 March 2021 (UTC)
nytimes.com links to All Movie Guide content
Links to https://www.nytimes.com/movies/person/* are dead and reporting as a soft-404 thus not picked up by archive bots. There are about 1300 articles with links in https and about 150 in http. The URLs are to The New York Times, but the content is licensed to All Movie Guide thus if in a CS1|2 citation it would convert to
|work=All Movie Guide
and
|via=The New York Times
. In addition an archive URL if available otherwise marked dead. Extra credit it could try to determine the date and author by scraping the archive page. Example. -- GreenC 18:00, 6 April 2021 (UTC)
Results
-- GreenC 00:25, 15 April 2021 (UTC)
articles.timesofindia.indiatimes.com links to timesofindia.indiatimes.com
Several years ago all the content on this subdomain was moved to timesofindia.indiatimes.com. However, the links are not the same and don't have any redirects and also cannot be re-constructed or guessed using any algorithms. One has to search in Google with the title of the link with former domain and update the link with the new domain.
LinkSearch
Old URL - http://articles.timesofindia.indiatimes.com/2001-06-28/pune/27238747_1_lagaan-gadar-ticket-sales (archived)
New URL - https://timesofindia.indiatimes.com/city/pune/film-hungry-fans-lap-up-gadar-lagaan-fare/articleshow/1796672357.cms
Is there a possibility for a WP:SEMIAUTOMATED bot with inputs from the user about the new url and update WP? Is there an existing bot? If not, I created a small semi-automated script (here) to assist me with the same functionality. Do I need to get an approval for this bot, if this is even considered a bot? -- Srihari Thalla (talk) 19:20, 8 April 2021 (UTC)
Are you seeing problems with content drift (content at the new page is different from the old). You'll need to handle existing |archive-url=, |archive-date= and |url-status= since can't change |url= and not |archive-url=, which if changed has to be verified working. There is
{{webarchive}}
that sometimes follow bare and square links might need removed or changed. The |url-status= should be updated from dead to live. There are
{{dead link}}
that might need to be added or removed. Should verify the new URL is working not assume it does; and if there are redirects in the headers capture those and change the URL to reflect. Those are the basics for this kind of work, it is not easy. Keep in mind there are 3 basic types of cites: those within a cite template, those in a square link, and those bare. Of those three types, the square and bare may have a trailing
{{webarchive}}
. All types may have a trailing
{{dead link}}
.
OR, my bot is done and can do all this. All that would be needed is a map of old and new URLs. There are as many as 20,000 URLs do you propose manually searching for each one? Perhaps better to leave unchanged and add archive URLs. Those that have no archive URL (ie. {{dead link}}) manually search for those to start. I could generate a list of those URLs with {{dead link}} while making sure everything else is archived. -- GreenC 20:24, 8 April 2021 (UTC)
If you already have the bot ready, then we can start with those that have no archive URL. If you could generate the list, I could also post on WP:INDIA asking for volunteers.
I would suggest to do this work, using a semi automated script ie., the script would read the page with the list and parse each row and print it on terminal (all details of the link possible, full cite/link title/etc) so that it would be easy for the user to search and once the new URL is found, the script takes the input and saves it to the page. Do you think this would be faster and convenient?
I would also suggest to form the list using the columns: serial number, link, cite/bare/square link, title (if possible), new url, new url status, new archive url, new archive url date. The last new ones being blank to be filled once researched. Does these columns look good?
Do you have a link to your bot? -- DaxServer (talk) 07:45, 9 April 2021 (UTC)
How about provide you with as much data as possible in a regular parsable format. I'd prefer not to create the final table as that should be done by the author of the semi-automated script based on its requirements and location. Does that sound OK? The bot page is User:GreenC/WaybackMedic_2.5 however it is 3 years out of date as is the GitHub repo, there have been many changes since 2018. The main bot is nearly 20k lines, but each URLREQ move request has its own custom module that is smaller. I can post an example skeleton module if you are interested, it is in Nim (programming language) which is similar to Python syntax. -- GreenC 18:24, 9 April 2021 (UTC)
The data in a parsable format is a good one to start with. Based on this, a suitable workflow can be established over time. The final table can be done later, as you said.
I unfortunately never heard of Nim. I know a little bit of Python and could have looked at Nim, but I do not have any time until mid-May. Would this be an example of a module, citeaddl? But this is Medic 2.1 and not 2.5. Perhaps you could share the example. If that looks like I can deal with without much learning curve, I would be able to workout something. If not, I would have to wait until end of May and then evaluate again! -- DaxServer (talk) 20:24, 9 April 2021 (UTC)
User:GreenC/software/urlchanger-skeleton-easy.nim is a generic skeleton source file. To give a sense of what is involved. It only needs modifying some variable at the top defining the domains old and new. There is a "hard" skeleton for more custom needs where mods are done throughout the file when the easy version is not enough. The file is part of the main bot, isolating domain-specific changes to this file. I'll start on the above it will take a few days probably depending how many URLs are found. -- GreenC 01:42, 11 April 2021 (UTC)
@DaxServer: The bot finished. Cites with
{{dead link}}
are recorded at Wikipedia:Link rot/cases/Times of India (raw) about 150. -- GreenC 20:57, 16 April 2021 (UTC)
Good to hear! Thanks @GreenC -- DaxServer (talk) 11:16, 17 April 2021 (UTC)
Results
odiseos.net is now a gambling website
There were two references to this website. I have removed one. The archived url has the content. Should this citation be preserved or removed?
My edit and existing citation -- DaxServer (talk) 07:50, 9 April 2021 (UTC)
This is a usurped domain. Normally they would be changed to |url-status=usurped. The talk page instance is removed because the "External links modified" section can be removed it is an old system no longer used. I'll need to update the InternetArchiveBot database to indicate this domain should be blacklisted but the service is currently down for maintenance. https://iabot.toolforge.org/ -- GreenC 17:10, 9 April 2021 (UTC)
I have also reverted my edit to include the |url-status=usurped (new edit). -- DaxServer (talk) 20:33, 9 April 2021 (UTC)
Migrate old URLs of "thehindu.com"
Old URLs from sometime before 2010 have a different URL structure. The content is moved to a new URL but a direct redirect is not available. The old URL is redirected to list page which is categorized by date the article is published. One has to search the title of the article and follow the link. Surprisingly, some archived URLs I tested were redirected to the new archived URL. My guess is that the redirection worked in the past, but was broken at some point.
Old URL - http://hindu.com/2001/09/06/stories/0406201n.htm (archived in 2020 - automatically redirected to the new archived url; old archive from 2013)
Redirected to list page - https://www.thehindu.com/archive/print/2001/09/06/
Title - IT giant bowled over by Naidu
New URL from the list page - https://www.thehindu.com/todays-paper/tp-miscellaneous/tp-others/it-giant-bowled-over-by-naidu/article27975551.ece
There is no content shift from the old URL (2013 archive) and new URL.
Example from N. Chandrababu Naidu - PS. This citation is used twice (as searched by the title), one with old url and the other with new url. -- DaxServer (talk) 14:18, 9 April 2021 (UTC)
The new URL [4] is behind a paywall and unreadable while the archive of the old URL [5] is fully readable. I think it would be preferable to maintain archives of the old URLs since they are not paywalled and there would be no content drift concern. Perhaps similar to above attempt to migrate when a soft-404 that redirects to a list page when no archive is available. -- GreenC 17:37, 9 April 2021 (UTC)
In that case, perhaps the WaybackMedic or the IA bot can add archived urls to all these links? If you want to be more specific, here is the regex of the URLs that I have found so far. There can be others which I have not encountered yet.
https?\:\/\/(www\.)?(the)?hindu\.com\/(thehindu\/(fr|yw|mp|pp|mag)\/)?\d{4}\/[01]\d\/[0-3][0-9]\/stories\/[0-9a-z]+\.htm
-- DaxServer (talk) 20:39, 9 April 2021 (UTC)
Can you verify the regex because I don't think it would match on the above "Old URL" in the segment \d{4}\/[01]\d\/[0-3][0-9]\/ .. maybe it is a different URL variation? -- GreenC 21:52, 9 April 2021 (UTC)
It matches. I checked it on regex101 and also on Python cli. Maybe, here is a simpler regex.
https?\:\/\/(www\.)?(the)?hindu\.com\/(thehindu\/(fr|yw|mp|pp|mag)\/)?\d{4}\/\d{2}\/\d{2}\/stories\/[0-9a-z]+\.htm -- DaxServer (talk) 12:02, 10 April 2021 (UTC)
Ahh got it sorry misread thanks. -- GreenC 13:33, 10 April 2021 (UTC)
Regex modified to work with Elasticsearch insource: and some additional matches. 12,229
insource:/\/{2}(www[.])?(the)?hindu[.]com\/(thehindu\/)?((cp|edu|fr|lf|lr|mag|mp|ms|op|pp|seta|yw)\/)?[0-9]{4}\/[0-9]{2}\/[0-9]{2}\/stories\/[^.]+[.]html?/
-- GreenC 04:27, 17 April 2021 (UTC)
DaxServer, the Hindu is done. Dead link list: Wikipedia:Link rot/cases/The Hindu (raw). -- GreenC 13:24, 23 April 2021 (UTC)
Great work @GreenC !! -- DaxServer (talk) 16:58, 23 April 2021 (UTC)
sify.com
Any link that redirects to the home page. Example. Example.-- GreenC 14:27, 17 April 2021 (UTC)
Results
ancient.eu
Ancient History Encyclopedia has rebranded to World History Encyclopedia and moved domain to worldhistory.org. There are many references to the site across Wikipedia. All references pointing to ancient.eu should instead point to worldhistory.org. Otherwise the URL structure is the same (ie. https://www.ancient.eu/Rome/ is now https://www.worldhistory.org/Rome/​). — Preceding unsigned comment added by Thamis (talkcontribs)
Hi @Thamis:, thanks for the lead/info, this is certainly possible to do. Do you think there is a reason to consider Content Drift ie. the page at the new site is different from the original (in substance), or largely a 1:1 copy of the core content? Comparing this page with this page it looks like this is a administrative change and not a content change. -- GreenC 23:40, 20 April 2021 (UTC)
Thanks for looking into this, @GreenC:. There's no content drift, it's a 1:1 copy of the content with the exact same URLs (just the domain is different). When I compare the two Rome pages from the archive and the new domain that you linked, I see the exact same page. The same is true for any other page you might want to check. :-)
@Thamis:, this url works but this url does not. The etc.ancient.eu sub-domain did not transfer, but still works at the old site. For these it will skip as the link still works and I don't want to add an archive URL to live links if it will be transferred in the future to worldhistory.org. Can be revisited later. -- GreenC 16:03, 23 April 2021 (UTC)
@GreenC: Indeed, that etc.ancient.eu subdomain was not transferred. It's the www.ancient.eu domain that turned into www.worldhistory.org -- subdomains other than "www" should be ignored.
@Thamis: it is done. In addition to the URLs it also changed/added |work=, etc.. to World History Encyclopedia. It got about 90%, but the string "Ancient History Encyclopedia" still exists in 89 pages/cites, they will require manual work to convert (the URLs are converted only the string is not). They are mostly free-form cites with unusual formatting and would benefit from manual cleanup probably ideally conversion to
{{cite encyclopedia}}
. -- GreenC 01:07, 24 April 2021 (UTC)
Results
@GreenC: Thanks a lot for sorting this out! Greatly appreciated. :-) — Preceding unsigned comment added by Thamis (talkcontribs)
You are welcome. If you are looking for more ideas how to improve.. converting everything to a cite template will make future maintenance easier and less error prone. However, I would not recommend creating a custom template as they are prone to breakage since they require special custom code for tools to work vs. standard cite templates which are better supported by tools. -- GreenC 18:01, 6 May 2021 (UTC)
*.in.com
Everything dead. Some redirect to a new domain homepage unrelated to previous site. Some have 2-level deep sub-domains. All now set to "Blacklisted" in IABot for global wiki use, a Medic pass through on enwiki will also help. -- GreenC 04:13, 25 April 2021 (UTC)
Results
Remove oxfordjournals.org
Hello, I think all links to oxfordjournals.org subdomains in the url parameter of {{cite journal}} should be removed, as long as there's at least a doi, pmid, pmc, or hdl parameter set. Those links are all broken, because they redirect to an HTTPS version which uses a certificate valid only for silverchair.com (example: http://jah.oxfordjournals.org/content/99/1/24.full.pdf ).
The DOI redirects to the real target URL, which nowadays is somewhere in academic.oup.com, so there's no point in keeping or adding archived URLs or url-status parameters. These URLs have been broken for years already, so it's likely they will never be fixed. Nemo 07:13, 25 April 2021 (UTC)
About 15,000. I have been admonished for removing archive URLs because of content drift ie. the page at the time of citation contains different content then the current one (academic.oup.com), therefore the archive URL is useful for showing the page at time of citation for verification purposes. OTOH if there is reason to believe content drift is not a concern for a particular domain, that is not my call to make some else would need to do that research and determine if this should be of concern. @Nemo bis: -- GreenC 16:03, 25 April 2021 (UTC)
The "version of record" is the same, so the PDF at the new website should be identical to the old one. The PubMed Central copy is generally provided by the publisher, too. So the DOI and PMC ID, if present, eliminate any risk of content drift. On the other hand, I'm pretty sure whoever added those URLs didn't mean to cite a TLS error page. :) Nemo 18:21, 25 April 2021 (UTC)
I can do this, just will need some time thanks. -- GreenC
@Nemo bis: edited 20 articles: 12345 - I forgot to remove |access-date= in a few cases. Do you see any other problems? -- GreenC 00:50, 6 May 2021 (UTC)
Looks good at first glance. I don't remember if Citoid or Citation bot are able to extract the DOI from the HTML in later stages once they can fetch the HTML from wayback machine, but either way it's good to have it. Nemo 06:01, 6 May 2021 (UTC)
@GreenC and Nemo bis: Just saw an edit about this, but the links seem to work fine now? Thanks. Mike Peel (talk) 18:39, 7 May 2021 (UTC)
What example link is working for you? -- GreenC 18:47, 7 May 2021 (UTC)
@GreenC: I tried the example link above, and the one I reverted at [6] (I assume you got a notification about that?). They both redirect fine. Thanks. Mike Peel (talk) 18:54, 7 May 2021 (UTC)
I don't know what is happening. The message Nemo and I got was:
Firefox does not trust this site because it uses a certificate that is not valid for jah.oxfordjournals.org. The certificate is only valid for the following names: *.silverchair.com, silverchair.com, gsw.contentapi.silverchair.com, dup.contentapi.silverchair.com - Error code: SSL_ERROR_BAD_CERT_DOMAIN
This is Windows 7 Firefox 88.01 - when tried with Chrome it works going to a captcha of the type "click all squares with a bus" 4 or 5 times, then it goes through to the content. Nemo, are you also using Firefox on Windows? -- GreenC 20:25, 7 May 2021 (UTC)
@GreenC: I'm using Firefox on a Mac. Please could you stop the edits until we can figure out what's going on? Thanks. Mike Peel (talk) 20:27, 7 May 2021 (UTC)
Done. -- GreenC 20:28, 7 May 2021 (UTC)
Mike, doesn't the http://jah.oxfordjournals.org/content/99/1/24.full.pdf URL redirect to https://jah.oxfordjournals.org/content/99/1/24.full.pdf and give a TLS error to you? Nemo 20:32, 7 May 2021 (UTC)
I get a PDF. The download link does start with watermark.silverchair.com though. Thanks. Mike Peel (talk) 20:35, 7 May 2021 (UTC)
Have you tried with another browser? Are you sure you haven't allowed that domain to bypass TLS security? Nemo 20:40, 7 May 2021 (UTC)
See my comment at the end of this section. I might have added an exception, since I use journal articles a lot, but I think that should only affect one browser. Have you tried doing that? Thanks. Mike Peel (talk) 20:42, 7 May 2021 (UTC)
(edit conflict) I guess that's a rare case of a domain that wasn't broken, but all the usual subdomains for journals are broken. That edit was fine anyway because the new link (to doi.org) goes to the same place and is more stable. We don't know how long the legacy OUP domains will even exist at all. Nemo 20:30, 7 May 2021 (UTC)
@Nemo bis: The first pass is done, some problems. There are cases of non-
{{cite journal}}
that contain DOIs etc.. Example. The bot was programmed for journal + aliases only. And I missed
{{vcite journal}}
. [7] There are cases of
{{doi}}
it's not setup to detect [8]. There were 1,750 archive URLs added, so these problems would be in that group, though most of them are fine. -- GreenC 18:45, 7 May 2021 (UTC)
Nice bot cooperation! When the URL is removed, doi-access=free can do its job properly. Direct links to PDFs on wayback are nice, links to archive.today which only serve me a captcha I don't know. I see we still have 4000 articles with oxfordjournals.org, we can probably reduce that.
The {{doi}} cases we can't do much about, need to wait for citation bot or others to transform them into structured citations. Same for the non-standard templates: sometimes people who use them are very opinionated.
I think the easiest win now is to replace some of the most commonly used citations, which are often the result of mass-creation of articles about species a decade ago. For instance a replacement similar to this would help some 300 articles:
[http://mollus.oxfordjournals.org/content/77/3/273.full Bouchet P., Kantor Yu.I., Sysoev A. & Puillandre N. (2011) A new operational classification of the Conoidea. Journal of Molluscan Studies 77: 273–308.] → {{cite journal|first1=P.|last1=Bouchet|first2=Y. I.|last2=Kantor|first3=A.|last3=Sysoev|first4=N.|last4=Puillandre|title=A new operational classification of the Conoidea (Gastropoda)|journal=Journal of Molluscan Studies|date=1 August 2011|pages=273–308|volume=77|issue=3|doi=10.1093/mollus/eyr017|url=https://archimer.ifremer.fr/doc/00144/25544/23686.pdf}}
(You can probably match any text between two ref tags or between a bullet and a newline which matches /content/77/3/273.) I just converted the DOI to a {{cite journal}} syntax with VisualEditor/Citoid and added what OAbot would have done. There are a few cases like this, you probably can find them from the IAbot database or from a query of the remaining links. These are the most common IDs among them:
$ curl -s https://quarry.wmflabs.org/run/550945/output/1/tsv | grep -Eo "(/[0-9]+){3}" | sort | uniq -c | sort -nr | head 69 /21/7/1361 60 /22/10/1964 51 /77/3/273 29 /24/6/1300 28 /19/7/1008 25 /24/20/2339 21 /55/6/912 17 /22/2/189 16 /19/1/2 15 /11/3/257
Nemo 20:30, 7 May 2021 (UTC)
Testing [9] in Safari, Chrome, and Firefox on a Mac, no problems, it redirects fine... Thanks. Mike Peel (talk) 20:32, 7 May 2021 (UTC)
We can definitely replicate the problem on two computers (my own and Nemo) so good chance it is happening for others. There is also, which is better, with the URL or without? With the URL (assuming you get through) it asks for a captcha which is somewhat difficult to get past, and it's a link to a site that is vendor specific. Without the URL it goes to doi.org - long term reliable - and opens the PDF without a captcha, or potential SSL problems. Comparing before and after deletion, the citation has been improved IMO. -- GreenC 20:47, 7 May 2021 (UTC)
Do you have HTTPS Everywhere? I see the http://mollus.oxfordjournals.org/content/77/3/273.full redirects directly https://academic.oup.com/mollus/article/77/3/273/1211552 without, but if the redirect is to https://mollus.oxfordjournals.org/content/77/3/273.full then nothing works because this URL is served incorrectly.
Anyway, this was just one of many issues with those old oxfordjournals.org URLs: there are also pmid URLs which don't go anywhere, URLs which redirect to the mainpage of the respective journal and so on. When we have a DOI there's no reason to keep them, they're ticking bombs even if they just happen to work for now. Nemo 20:53, 7 May 2021 (UTC)
I do have HTTPS Everywhere, turning it off got through (to a captcha). That should not happen. It would be an improvement to replace with DOI URLs when available. -- GreenC 21:14, 7 May 2021 (UTC)
Ok, I've sent a patch for the ruleset. Nevertheless I recommend to proceed with the cleanup because we're never going to be able to babysit the fate of 390 legacy domains. I'm listing at User:Nemo bis/Sandbox some suggestion for more specific replacements. (Some URLs need to be searched in all their variants, especially that first one "77/3/273".) Nemo 21:38, 7 May 2021 (UTC)
(edit conflict) GreenC, would you kindly stop your bot from doing this? You are removing working links for no reason. Here you removed http://jhered.oxfordjournals.org/content/30/12/549.extract​, but that link works just fine, and redirects effortlessly to https://academic.oup.com/jhered/article-abstract/30/12/549/911170​. If it isn't broken there's (really!) no need to mend it (and even less to break it). If you want to replace the old link with the new one that's fine with me (I've already done a few), but please stop removing working links. Thanks, Justlettersandnumbers (talk) 21:44, 7 May 2021 (UTC)
Well that URL is broken for a few million users at the moment, so there is a reason to remove it. One alternative is to replace it with a doi.org URL if there is no doi-access=free or PMC parameter yet. Nemo 21:57, 7 May 2021 (UTC)
Have you tried contacting the journal about these issues? Since the links *do* work (possibly unless you apply extra restrictions), I don't think these removals should be happening without asking the wider community first. Thanks. Mike Peel (talk) 08:28, 8 May 2021 (UTC)
OUP is notoriously impervious to pleas that they fix URLs or even DOIs. There's no point trying. Nemo 09:55, 8 May 2021 (UTC)
Fix pdfs.semanticscholar.org links
The pdfs.semanticscholar.org which HTTP 301 redirect to www.semanticscholar.org URLs are actually dead links. There are quite a few now. A link to the wayback machine is possible, but I believe the InternetArchiveBot would not normally add it. Nemo 21:15, 28 April 2021 (UTC)
They are soft 404 in the sense the landing page is 200 and serves related content, but not what is expected from the original URL (ie. a PDF). We can restore the PDF via the WaybackMachine and another archive providers as archive URLs. Being 404ish links, they should be saved as originally intended for WP:V purposes. If the citation already has an archive link it will be skipped. If no archive link can be found it will leave the URL in place and let Citation bot handle it - can generate a list of these there probably will not be many. -- GreenC 21:29, 28 April 2021 (UTC)
Makes sense, thank you! Nemo 06:42, 29 April 2021 (UTC)
Nemo, testing going well and about ready for the full run. There are a number of edge case types found that required special handing so good thing this is custom. Question: do you know if with this diff would Citation bot then keep the archive URL or remove it? -- GreenC 16:51, 29 April 2021 (UTC)
Those diffs look good. As far as I know, at the moment Citation bot is not removing those URLs; I've tested on a few articles after your bot's edits and they were left alone. Nemo 04:38, 30 April 2021 (UTC)
Nemo, looks done, let me know if you see any problems. -- GreenC 16:43, 30 April 2021 (UTC)
Thank you! Wikipedia:Link rot/cases/pdfs.semanticscholar.org is super useful. I noticed that OAbot can find more URLs to add when a DOI is available and the URL parameter is cleared. So I think I'll do another pass with OAbot by telling it to ignore the SemanticScholar URLs, and then I'll manually remove the redundant ones. Nemo 20:48, 1 May 2021 (UTC)
Actually, I'll track that at phabricator:T281631 for better visibility. Nemo 21:51, 1 May 2021 (UTC)
Results
TracesOfWar citations update
Wikipedia currently contains citations and source references ​​to the websites TracesOfWar.com and .nl (EN-NL bilingual), but also to the former websites ww2awards.com, go2war2.nl and oorlogsmusea.nl. However, these websites have been integrated into TracesOfWar in recent years, so that the source reference is now incorrect in hundreds of pages and a multiple of that in terms of the source references. Fortunately, there is currently the situation in which ww2awards and go2war2 still redirct to the correct page on TracesOfWar, but this is no longer the case for oorlogsmusea.nl. I have been able to correct all the sources for oorlogsmusea.nl manually. For ww2awards and go2war2 the redirects will stop in the short term, which will result in thousands of dead links, while it can be properly directed towards the same source. A short example: person Llewellyn Chilson (at Tracesofwar persons id 35010) now has a source reference to http://en.ww2awards.com/person/35010​, but this must be https://www.tracesofwar.com/persons/35010/​. In short, old format to new format in terms of url, but same ID.
In my opinion, that should make it possible to convert everything with format '​http://en.ww2awards.com/person/​[id]' (old English) or '​http://nl.ww2awards.com/person/​[id]' (old Dutch) to '​https://www.tracesofwar.com/persons/​[id]' (new English) or '​https://www.tracesofwar.nl/persons/​[id]' (new Dutch) respectively. The same applies to go2war2.nl, but with a different format slightly. http://www.go2war2.nl/artikel/​[id] becomes https://www.tracesofwar.nl/articles/​[id]. The same has already been done on the Dutch Wikipedia, via a similar bot request. Lennard87 (talk) 18:50, 29 April 2021 (UTC)
@Lennard87:, seeing around 500 mainspace URLs on enwiki for all domains combined. Can you verify not missing any? -- GreenC 22:18, 1 May 2021 (UTC)
@GreenC:, that is very well possible yes, but I have no exact numbers. In any case, those roughly 350 (go2war2+ww2awards) should be changed then to tracesofwar.com or .nl.
@Lennard87: results for www2awards it moved 251 URLs. Five examples show different types of problems: [10] [11][12][13][14] .. the variations on "WW2 Awards" and location in the cite are difficult. (BTW instead of /person/ some have /award/ which at the new site is /awards/ Example)-- GreenC 18:43, 2 May 2021 (UTC)
Results for go2war are similar it moved 48 URLs: [15][16] -- GreenC 19:26, 2 May 2021 (UTC)
@GreenC:, Thanks. Saw the situations, which are difficult, but the proposed changes are correct. Also yes, I forgot about the /award/ change; that can be applied too please. Only the last one with Gunther Josten is a difficult one, as the picture id has changed as well: https://www.mystiwot.nl/myst/upload/persons/9546061207115933p.jpg​. There is no relation between the two so best to leave 'images-person' alone or use the web archive trick.
Reuters
The new Reuters website redirected all subdomains to www.reuters.com and broke all links. That's about 50k articles on the English Wikipedia alone, I believe. I see that the domain is whitelisted on InternetArchiveBot, not sure whether that's intended. Nemo 20:13, 1 May 2021 (UTC)
Wow that's major. Domains can become auto-whitelisted if the bot is receiving confusing messages by way of user reverts (of the bot). Looks like some subdomains still work [17]. Or correctly return 404 and would be picked up by IABot - except for the whitelist [18]. Or soft 404'ing [19]. How to determine a soft404 is an art, in this case easy enough it redirects to a page with a title "Homepage" but there are probably other unknown landing locations. WaybackMedic should be able to do this, it has good code for following redirects, checking headers and verifying (known) soft404s. Will not be able to start for at least a week to catch up on other things. Then will take a while due to the size. -- GreenC 21:59, 1 May 2021 (UTC)
Thanks. I count 249299 links to reuters.com across ~all wikis at the moment (phabricator:P15671). Nemo 08:06, 2 May 2021 (UTC)
Interesting how spread out they are, except enwiki. There is probably a rule similar to 80/40 but more like 40/60 or 33/66 (enwiki/everything else)-- GreenC 15:13, 2 May 2021 (UTC)
Dead links redundant with permanent links
Related to #Fix pdfs.semanticscholar.org links, or rather the work that followed it at phabricator:T281631, there are a few hundreds {{dead link}} notices which can be removed (together with the associated URL) because the DOI or HDL can be expected to provide the canonical permanent link. See a simple search at:
This is not nearly as urgent as the OUP issue above, and if it's complicated I may also do it manually, but it seems big enough to benefit from a bot run at some point. Nemo 16:26, 5 May 2021 (UTC)
To confirm, if a cite template contains |doi-access=free or |hdl-access=free and has a
{{dead link}}
attached, remove the {{dead link}} (plus
{{cbignore}}
) and the |url=.-- GreenC 20:11, 5 May 2021 (UTC)
Yes. Also a pmc. Nemo 20:58, 5 May 2021 (UTC)
|pmid= ? -- GreenC 18:03, 6 May 2021 (UTC)
IMHO not, because the PMID alone doesn't provide the full text, so the original URL might have had something different. The reason PMID is sufficient with the OUP links above is that PubMed links the same publisher landing page as the original URL. Nemo 05:54, 7 May 2021 (UTC)
SR/Olympics templates
Hello. As SR/Olympics has been shut down, several SR/Olympics templates are broken. They are Template:SR/Olympics country at games (250 usages), Template:SR/Olympics sport at games and Template:SR/Olympics sport at games/url (both 63 usages). See for example Algeria at the 2012 Summer Olympics and Football at the 2012 Summer Olympics. I'm not sure if InternetArchiveBot can work with these templates. I was wondering how these links could be fixed with archived URLS like at Template:Sports reference. Thanks! --MrLinkinPark333 (talk) 19:35, 10 May 2021 (UTC)
The first two already have an |archive= argument so it's just a matter of updating each instance with a 14-digit timestamp eg. |archive=20161204010101​. The last one is used by the second one which is why it has the same count, nothing to do there. For the first two, I guess it would require some custom code to find a working timestamp and add it. This is why I dislike custom templates, they don't work with standard tools, each instance a custom programming job. I'll see what I can do. -- GreenC 20:04, 10 May 2021 (UTC)
Last edited on 10 May 2021, at 20:04
Content is available under CC BY-SA 3.0 unless otherwise noted.
Privacy policy
Terms of Use
Desktop
HomeRandomNearbyLog inSettingsDonateAbout WikipediaDisclaimers
WatchEdit