Flagging non-dead link as dead edit

This edit flagged this URL as dead even though it isn't. Jo-Jo Eumerus (talk) 11:17, 18 July 2022 (UTC)Reply

Same with these edits:
I appreciate it probably has to do with some kind of automatic PDF link serving in Javascript that Academia.edu uses wouldn't be readily captured with a bot; I don't know how fixable it is, but the links noted are not dead at all; I reverted both edits that the bot flagged. Ifly6 (talk) 14:35, 18 July 2022 (UTC)Reply
The url that Editor Jo-Jo Eumerus linked:
Both of the urls that Editor Ifly6 links:
There was some discussion about these kinds of academia links at Wikipedia:Link rot/URL change requests § www.academia.edu/download/
Trappist the monk (talk) 14:43, 18 July 2022 (UTC) 14:46, 18 July 2022 (UTC)Reply
  • Jo-Jo Eumerus & User:Ifly6 they are dead for me (USA). Example. Are you getting a redirect to a cloudfront URL? Wondering if there is some kind of location-aware policy that determines when to serve the cloudfront URL vs a 404. If the cloudfront URL was known, it would be possible to save it at the Wayback Machine, then use the Cloudfront-Wayback URL on Wikipedia treated as a dead link (due to its &Expires self-destruct mechanism see WP:AWSURL). However, I wonder about copyright if academia.edu is making them unavailable in the US and possibly elsewhere, question why have that policy if not a rights issue. -- GreenC 15:04, 18 July 2022 (UTC)Reply
    I'm in the US and am getting the links promptly. The links I am getting are Cloudfront ones with an expiry; I used the Academic.edu links to avoid the known expiry. Ifly6 (talk) 15:41, 18 July 2022 (UTC)Reply
    Ah I see you use British English so I assumed you are not US. What browser do you use? Do you have any plugins that might affect javascript? This is impacting archive providers as well, such as Wayback Machine and Ghostarchive (US-based), they also get 404. Archive.today it "works" (global IP pool) but they are unable to correctly save the PDF. -- GreenC 16:00, 18 July 2022 (UTC)Reply
    I do get a "d1wqtxts1xzle7.cloudfront.net" sort of thing. Jo-Jo Eumerus (talk) 17:33, 18 July 2022 (UTC)Reply
    Language heuristics are always right 99pc of the time haha. I've confirmed on Edge (Windows 10) and Safari (macOS) that the Academia.edu link work. I don't have any plugins installed other than ad blockers that would affect something like this. The specific link that got generated for me with Rafferty was https://d1wqtxts1xzle7.cloudfront.net/51344857/Iris-_Fall_of_the_Roman_Republic-with-cover-page-v2.pdf. There were then a pile of GET parameters that I've excerpted – they change every time anyway – but are necessary to get the file served properly. Ifly6 (talk) 19:24, 18 July 2022 (UTC)Reply
    Jo-Jo Eumerus do you use Edge or Safari? -- GreenC 19:38, 18 July 2022 (UTC)Reply
    Wikipedia:Village_pump_(technical)#academia.edu/download .. seeing if anything comes up here. -- GreenC 19:52, 18 July 2022 (UTC)Reply
    Ifly6 in the above thread someone suggested perhaps you had signed up for account on academia.edu at some point? Or some old cookies that are giving permission. One way to test is try to access from a private window. -- GreenC 20:46, 18 July 2022 (UTC)Reply
    Yea, that's probably it. I opened it in a private window and got the 404. Ifly6 (talk) 20:57, 18 July 2022 (UTC)Reply
    Same for me (Firefox) Jo-Jo Eumerus (talk) 21:12, 18 July 2022 (UTC)Reply
    Cool, glad it is figured out what is causing it. My thinking is to replace the academia.edu links with a Wayback version of the cloudfront URL so it's accessible for everyone. Or second option is to use |url-access=registration but that 404 page is confusing and will result in bots marking it dead. -- GreenC 21:30, 18 July 2022 (UTC)Reply

User:Jo-Jo Eumerus|User:Ifly6|User:Biogeographist: Would like to propose this solution: Special:Diff/1098978075/1099315632. It's only for academia.edu/download links, which are about 1,000 on enwiki.

  • academia.edu returns a 404 when a user is not registered and logged in, which is most users. It does not say "log in to access paper", rather a misleading 404 dead link page. This causes problems:
    • Archive bots will determine the links are dead (404) and mark with a {{dead link}}.
    • Users will be confused thinking the link is dead and not behind a registration wall.
    • Should the link ever actually die for real, there would be no archive available since the Wayback Machine sees only a dead 404 page - the Wayback machine is not an academia.edu registered user.
  • While possible to use |url-access=registration this does not solve the misleading 404 problems.
  • The cloudfront link is an AWS container with an &Expires self-destruct mechanism. It's where the paper is actually located (not on academia.edu which redirects to cloudfront).
  • The proposal is to determine the active cloudfront link via bot magic, immediately create a Wayback Machine save of the cloudfront URL, and change the citation to the Wayback-cloudfront link. eg. Special:Diff/1098978075/1099315632

This is what I can do somewhat easily right away. There are limits due to bot design and coding efforts what can be done. -- GreenC 04:15, 20 July 2022 (UTC)Reply

Hmm. It seems a bit complex and I wonder if people will be deleting the "expires" part of the link. Jo-Jo Eumerus (talk) 10:22, 20 July 2022 (UTC)Reply
It's a complex situation. If they delete the &Expires the URL will break (404). It will break anyway, due to the Expires, that is why the archive URL version is made the primary. The archive URL is accessible to everyone - academia.edu account not required. -- GreenC 15:30, 20 July 2022 (UTC)Reply

Unfortunately there is something preventing cloudfront pages from being saved at Wayback. Not all pages, but most. So we have a bad situation with academia.edu/download links - ideally they should be converted to a non /download/ links - but can't be done by bot requires manual searching. The /download/ links are probably originating from Google Scholar, copy-pasting. -- GreenC 15:56, 23 July 2022 (UTC)Reply

Backlinks report edit

User:Certes/Backlinks/Report seems to have stopped, but User:GoingBatty/Backlinks/Report is running normally. I've not added any new backlinks recently. Can you see anything else that I may have broken? Certes (talk) 11:17, 25 July 2022 (UTC)Reply

It aborted for unknown reasons. I increased the memory allocation by 10x in case that is the problem. The data may be messed up from the abort. I've restarted the process and will see what happens over the next hour or so if it can recover. Worse case will just delete all the data and it will rebuild from scratch, but that will result in a missed day. -- GreenC 15:34, 25 July 2022 (UTC)Reply
Thanks. Let me know if I'm checking too many targets or if some produce exceptionally big reports, and I'll remove the less productive ones. Certes (talk) 15:45, 25 July 2022 (UTC)Reply
It was crashing at "m" then after increasing memory made it to "v". Odd bc it should not run out of memory, and there are no error messages system or program to suggest why it's silently halting so it might be something different. I added debug statements, takes a while to replicate an hour or more. Thanks for holding. -- GreenC 04:26, 26 July 2022 (UTC)Reply
Odd: "m" and "v" are early in my list, and neither they nor anything earlier have many incoming links. If it's taking an hour then we may need to remove the entries with lowest benefit per second. A few entries have never triggered a fix and could probably be removed, but I've already removed the resource-heavy ones. Maybe I need to rate them all by fixes done per 1000 incoming links or similar and chop those scoring lowest. "v" is an oddity because it can indicate that the editor failed to press Ctrl when pasting: easy to spot, but hard to fix as you need to guess what was in their clipboard. Certes (talk) 12:39, 26 July 2022 (UTC)Reply
The memory problem appears to be cumulative if I run m or v in isolation they do fine but when running the whole bunch there is a massive spike in memory claim that occurs at the same spot around v or x, but also others don't release their claims so it builds up. It could be related to the Sun Grid Engine caching for performance reasons. I've checked the program for errant global vars and it's fine there is nothing holding onto data. I might try separating the backlinks retrieval portion to a different program so it exits between each item clearing any memory claims. -- GreenC 16:48, 26 July 2022 (UTC)Reply
I think it is fixed. A combination of repetitive backlinks reported by the API and inefficiencies in the program magnifying those repetitions. It should never use more than about 25MB of ram, but with "V" (and "v") it was as high as 1 gigabyte. Why V? I suspect it's due to WP:V which is so commonly linked outside mainspace. V exposed the problem, but it was occurring at a smaller scale with everything else. (The API typically and erroneously reports 100s of the same backlink - I don't know why it's always done this.) "V" had 2.5 million non-unique occurrences. Add to this the program was inefficient in how it dealt with the repetitions, it added up and the Grid Engine was nope and dropped the job. Right now it's starting over rebuilding the database, it should be back to normal soon. -- GreenC 05:44, 27 July 2022 (UTC)Reply
Thanks very much. The current version looks right, considering that it's for a few hours rather than the usual 24. Is it possible to add the namespace of the link target to the query? I'm not sure how you're extracting the data but, for example, Quarry would run its SQL much faster with "and pl_namespace=0". Certes (talk) 11:21, 27 July 2022 (UTC)Reply
API:Backlinks. When I first made this program (not your fork of it) around April 2015, Quarry was only about 6 months old I think, anyway I wasn't aware of it, and I wanted something that would run from anywhere which left the API. Speed is not an issue when running daily, unless it takes > 24hrs. Your job completes in about 2 hours, it is exceptionally big. The API behavior of multiple results is weird but can be adjusted for. If it continues to be a problem I can look into Quarry, getting a JSON file would nice. -- GreenC 15:41, 27 July 2022 (UTC)Reply
In that case, blnamespace is what I meant, but I'm not clear what it should be set to: the several namespaces in which relevant links appear, or ns 0 to which relevant links lead. If my job is taking two hours then I should be checking fewer targets; any clues as to which entries take the most time would help with that. Certes (talk) 18:27, 27 July 2022 (UTC)Reply
Below is an 'ls' of the data files. The timestamps show how long each took to complete. The file size is misleading as the program filters out namespaces. Like "V" (and "v" they are indenitcal to the API) is not very large filesize, but took almost 25 minutes to complete. It took about 85m to finish not 120m my mistake. V/v is about 50 minutes. U/u 20 minutes. N/n 10 minutes. Those are the big three and use 95% of the time (is that right?). Probably due to WP:V, WP:U and WP:N. -- GreenC 19:28, 27 July 2022 (UTC)Reply
Thanks. I'll take V/v, U/u and N/n out then. U and N rarely get a hit. V gets more but I'm less confident about fixing them as most of them require me to guess what article the editor was thinking of. Certes (talk) 20:57, 27 July 2022 (UTC)Reply
All working as normal today, and an hour faster than previously. Thanks again for your help. Certes (talk) 10:03, 28 July 2022 (UTC)Reply
Yes, finished in 25 minutes. No single one took very long (or much memory!). You are welcome and thanks for reporting it because it uncovered a problem in the program that only became evident at scale. -- GreenC 15:52, 28 July 2022 (UTC)Reply
Extended content
22930	Jul	27	09:11	0.new
127027	Jul	27	09:11	1.new
16924	Jul	27	09:11	2.new
15575	Jul	27	09:11	3.new
15540	Jul	27	09:11	4.new
14709	Jul	27	09:12	5.new
12741	Jul	27	09:12	6.new
17054	Jul	27	09:12	7.new
15220	Jul	27	09:12	8.new
14745	Jul	27	09:12	9.new
7476	Jul	27	09:13	10.new
6315	Jul	27	09:13	100.new
15741	Jul	27	09:13	A.new
13776	Jul	27	09:13	B.new
16104	Jul	27	09:13	C.new
13410	Jul	27	09:13	D.new
13301	Jul	27	09:14	E.new
12605	Jul	27	09:14	F.new
13550	Jul	27	09:14	G.new
13518	Jul	27	09:14	H.new
14387	Jul	27	09:14	I.new
13005	Jul	27	09:14	J.new
12845	Jul	27	09:14	K.new
14099	Jul	27	09:14	L.new
13174	Jul	27	09:14	M.new
39805	Jul	27	09:18	N.new
13668	Jul	27	09:19	O.new
13088	Jul	27	09:19	P.new
11858	Jul	27	09:19	Q.new
14160	Jul	27	09:19	R.new
14529	Jul	27	09:19	S.new
13146	Jul	27	09:19	T.new
15718	Jul	27	09:21	U.new
96856	Jul	27	09:45	V.new
12403	Jul	27	09:45	W.new
12797	Jul	27	09:45	X.new
13659	Jul	27	09:45	Y.new
13403	Jul	27	09:45	Z.new
15741	Jul	27	09:45	a.new
13776	Jul	27	09:45	b.new
16104	Jul	27	09:45	c.new
13410	Jul	27	09:46	d.new
13301	Jul	27	09:46	e.new
12605	Jul	27	09:46	f.new
13550	Jul	27	09:46	g.new
13518	Jul	27	09:46	h.new
14387	Jul	27	09:46	i.new
13005	Jul	27	09:46	j.new
12845	Jul	27	09:46	k.new
14099	Jul	27	09:46	l.new
13174	Jul	27	09:46	m.new
39805	Jul	27	09:51	n.new
13668	Jul	27	09:51	o.new
13088	Jul	27	09:51	p.new
11858	Jul	27	09:51	q.new
14160	Jul	27	09:51	r.new
14529	Jul	27	09:51	s.new
13146	Jul	27	09:51	t.new
15718	Jul	27	09:53	u.new
96856	Jul	27	10:16	v.new
12403	Jul	27	10:16	w.new
12797	Jul	27	10:16	x.new
13659	Jul	27	10:16	y.new
13403	Jul	27	10:16	z.new
217699	Jul	27	10:17	ABC
5951	Jul	27	10:17	Accolade.new
118095	Jul	27	10:17	Acre.new
89027	Jul	27	10:17	Admiral.new
22088	Jul	27	10:17	Alphabet.new
29758	Jul	27	10:17	Amber.new
4295	Jul	27	10:17	Amen.new
31785	Jul	27	10:17	Aperture.new
2643	Jul	27	10:17	Ash.new
2643	Jul	27	10:17	ash.new
44238	Jul	27	10:17	Atlantic.new
1375	Jul	27	10:17	Back.new
1375	Jul	27	10:17	back.new
36337	Jul	27	10:17	Bay.new
36337	Jul	27	10:17	bay.new
53374	Jul	27	10:17	Bowling.new
53374	Jul	27	10:17	bowling.new
2048	Jul	27	10:17	Cabinet
36569	Jul	27	10:17	Captain.new
36569	Jul	27	10:17	captain.new
12368	Jul	27	10:17	Calvary.new
12368	Jul	27	10:17	calvary.new
26920	Jul	27	10:17	Caterpillar.new
28665	Jul	27	10:17	Chancellor.new
28665	Jul	27	10:17	chancellor.new
31754	Jul	27	10:17	Chestnut.new
31754	Jul	27	10:17	chestnut.new
4924	Jul	27	10:17	Chin.new
725	Jul	27	10:17	Clipboard.new
725	Jul	27	10:17	clipboard.new
44162	Jul	27	10:17	Colony.new
44162	Jul	27	10:18	colony.new
3070	Jul	27	10:18	Colonies.new
3070	Jul	27	10:18	colonies.new
55	Jul	27	10:18	Colors.new
55	Jul	27	10:18	colors.new
565	Jul	27	10:18	Colours.new
565	Jul	27	10:18	colours.new
138372	Jul	27	10:19	Company.new
138372	Jul	27	10:20	company.new
6611	Jul	27	10:20	Companies.new
6611	Jul	27	10:20	companies.new
14699	Jul	27	10:20	Consul.new
14699	Jul	27	10:20	consul.new
76725	Jul	27	10:20	Colorado
3180	Jul	27	10:21	Commonwealth.new
3180	Jul	27	10:21	commonwealth.new
30657	Jul	27	10:21	Conservative.new
1206	Jul	27	10:21	Conservatives.new
113900	Jul	27	10:21	Corvette.new
2005	Jul	27	10:21	Corvettes.new
28639	Jul	27	10:21	Delphi.new
48181	Jul	27	10:21	Family.new
48181	Jul	27	10:21	family.new
2257	Jul	27	10:21	Families.new
2257	Jul	27	10:21	families.new
61603	Jul	27	10:21	Icon.new
61603	Jul	27	10:21	icon.new
6665	Jul	27	10:21	Icons.new
6665	Jul	27	10:21	icons.new
5801	Jul	27	10:21	Interpreter.new
5801	Jul	27	10:21	interpreter.new
70977	Jul	27	10:21	Jupiter.new
12095	Jul	27	10:21	Knot.new
12095	Jul	27	10:21	knot.new
80891	Jul	27	10:21	Krishna.new
121459	Jul	27	10:21	Lead.new
121459	Jul	27	10:21	lead.new
127	Jul	27	10:21	Liberal
180	Jul	27	10:21	Libertarian
183969	Jul	27	10:22	Madonna.new
183969	Jul	27	10:22	madonna.new
65528	Jul	27	10:22	Mass.new
65528	Jul	27	10:22	mass.new
5378	Jul	27	10:22	Meta.new
770	Jul	27	10:22	Ministry
3160	Jul	27	10:22	Model.new
3160	Jul	27	10:22	model.new
176677	Jul	27	10:23	Moon.new
176677	Jul	27	10:23	moon.new
214735	Jul	27	10:23	National
199067	Jul	27	10:23	Oxygen.new
76332	Jul	27	10:23	Primate.new
76332	Jul	27	10:23	primate.new
5462	Jul	27	10:23	Roland.new
346	Jul	27	10:24	Ronaldo.new
68973	Jul	27	10:24	Salt.new
68973	Jul	27	10:24	salt.new
16813	Jul	27	10:24	Season.new
16813	Jul	27	10:24	season.new
44306	Jul	27	10:24	Shiraz.new
44306	Jul	27	10:24	shiraz.new
53287	Jul	27	10:24	Spire.new
53287	Jul	27	10:24	spire.new
153867	Jul	27	10:24	Stream.new
153867	Jul	27	10:24	stream.new
11482	Jul	27	10:24	Telegram.new
3845	Jul	27	10:24	Thermal.new
3845	Jul	27	10:24	thermal.new
88519	Jul	27	10:24	Tree.new
88519	Jul	27	10:24	tree.new
3102	Jul	27	10:24	Trojan
3102	Jul	27	10:24	trojan
167	Jul	27	10:24	U.S.
2334	Jul	27	10:24	Victory.new
26424	Jul	27	10:24	Ardennes.new
19159	Jul	27	10:24	Aspen.new
1884	Jul	27	10:24	Baler.new
105737	Jul	27	10:25	Batman.new
20662	Jul	27	10:25	Battle.new
53364	Jul	27	10:25	Bethlehem.new
439921	Jul	27	10:25	Birmingham.new
11530	Jul	27	10:25	Boulder.new
54094	Jul	27	10:25	Brampton.new
14995	Jul	27	10:25	Calvados.new
208354	Jul	27	10:25	Cambridge.new
71179	Jul	27	10:25	Canterbury.new
15715	Jul	27	10:25	Caracal.new
203571	Jul	27	10:26	Christchurch.new
78460	Jul	27	10:26	Cicero.new
43543	Jul	27	10:26	Durango.new
18943	Jul	27	10:26	East
296629	Jul	27	10:26	Edmonton.new
12304	Jul	27	10:26	Esplanade.new
25247	Jul	27	10:26	Eye.new
32977	Jul	27	10:26	Flint.new
151	Jul	27	10:26	Gladstone.new
81116	Jul	27	10:26	Gloucester.new
56266	Jul	27	10:26	Greenwich.new
780	Jul	27	10:26	Guna.new
21889	Jul	27	10:26	Horsham.new
199436	Jul	27	10:26	Hyderabad.new
89915	Jul	27	10:26	Ipswich.new
15229	Jul	27	10:26	Ithaca.new
132579	Jul	27	10:27	Lagos.new
68478	Jul	27	10:27	La
18993	Jul	27	10:27	Leek.new
439197	Jul	27	10:27	Liverpool.new
26324	Jul	27	10:27	Loire.new
54	Jul	27	10:27	Loni.new
8106	Jul	27	10:27	Malmesbury.new
35538	Jul	27	10:27	Mansfield.new
7545	Jul	27	10:27	March.new
16434	Jul	27	10:27	Mold.new
25849	Jul	27	10:27	Moselle.new
33698	Jul	27	10:27	New
270789	Jul	27	10:27	New
205009	Jul	27	10:28	Norfolk.new
112023	Jul	27	10:28	Norwich.new
28431	Jul	27	10:28	Ore.new
71930	Jul	27	10:28	Pali.new
83138	Jul	27	10:28	Panama
373705	Jul	27	10:28	Perth.new
99124	Jul	27	10:28	Piedmont.new
22133	Jul	27	10:28	Pueblo.new
73659	Jul	27	10:28	Punjab.new
30869	Jul	27	10:28	Reading.new
100419	Jul	27	10:29	Republic
19646	Jul	27	10:29	Rye.new
23084	Jul	27	10:29	Saga.new
6106	Jul	27	10:29	Saint
5866	Jul	27	10:29	St.
11630	Jul	27	10:29	Saint
5336	Jul	27	10:29	St.
97107	Jul	27	10:29	St.
22068	Jul	27	10:29	Stanford.new
255991	Jul	27	10:29	Surrey.new
93952	Jul	27	10:29	Tripoli.new
50366	Jul	27	10:29	Troy.new
38853	Jul	27	10:29	Van.new
18130	Jul	27	10:29	Vosges.new
21909	Jul	27	10:29	Warwick.new
15455	Jul	27	10:29	Angels.new
23662	Jul	27	10:29	Arsenal.new
38084	Jul	27	10:29	Avalanche.new
2391	Jul	27	10:29	Barbarians.new
1558	Jul	27	10:29	Bears.new
5145	Jul	27	10:29	Border
296	Jul	27	10:29	Broncos.new
463	Jul	27	10:29	Buccaneers.new
1063	Jul	27	10:29	Canadiens.new
15399	Jul	27	10:29	Cavaliers.new
751	Jul	27	10:29	Cheetahs.new
367	Jul	27	10:29	Corinthians.new
3529	Jul	27	10:29	Coyotes.new
9722	Jul	27	10:29	Crusaders.new
5268	Jul	27	10:29	Dolphins.new
3090	Jul	27	10:29	Dragons.new
4159	Jul	27	10:29	Ducks.new
160	Jul	27	10:29	Eagles.new
45	Jul	27	10:29	Flames.new
48481	Jul	27	10:29	Force.new
181	Jul	27	10:29	Griquas.new
2627	Jul	27	10:29	Hawks.new
27971	Jul	27	10:29	Heat.new
653	Jul	27	10:29	Hornets.new
5809	Jul	27	10:29	Hurricanes.new
949	Jul	27	10:29	Jaguars.new
223	Jul	27	10:29	Jays.new
1571	Jul	27	10:29	Leopards.new
43470	Jul	27	10:30	Lightning.new
2409	Jul	27	10:30	Lions.new
229	Jul	27	10:30	Ospreys.new
1981	Jul	27	10:30	Pelicans.new
2413	Jul	27	10:30	Penguins.new
9026	Jul	27	10:30	Pirates.new
4012	Jul	27	10:30	Predators.new
2731	Jul	27	10:30	Rockets.new
802	Jul	27	10:30	Rockies.new
7330	Jul	27	10:30	Saints.new
9918	Jul	27	10:30	Saracens.new
3954	Jul	27	10:30	Sharks.new
3306	Jul	27	10:30	Stars.new
6305	Jul	27	10:30	Thunder.new
2129	Jul	27	10:30	Tigers.new
26592	Jul	27	10:30	Titans.new
3808	Jul	27	10:30	Twins.new
98682	Jul	27	10:30	Vikings.new
663	Jul	27	10:30	Warriors.new
3396	Jul	27	10:30	Wasps.new
5597	Jul	27	10:30	Wolves.new
6	Jul	27	10:30	Zunz.new
795	Jul	27	10:30	Orsini.new
226	Jul	27	10:30	Rockefeller.new
32	Jul	27	10:30	Paintal.new
483	Jul	27	10:30	Rothschild.new
8	Jul	27	10:30	Pevsner.new
4861	Jul	27	10:30	O'Reilly.new
62	Jul	27	10:30	Primo
18	Jul	27	10:30	Cimarosa.new
53	Jul	27	10:30	Narasimha
505	Jul	27	10:30	Caracciolo.new
155	Jul	27	10:30	Bakunin.new
665	Jul	27	10:30	Weber.new
26	Jul	27	10:30	Malevich.new
57	Jul	27	10:30	Korotayev.new
18	Jul	27	10:30	Krauser.new
186	Jul	27	10:30	Ghazali.new
266	Jul	27	10:30	Touré.new
190	Jul	27	10:30	Sadat.new
288	Jul	27	10:30	Rajguru.new
289	Jul	27	10:30	Maitland.new
83	Jul	27	10:30	Strozzi.new
90	Jul	27	10:30	Delacroix.new
167	Jul	27	10:30	Reuter.new
185	Jul	27	10:30	Baden
31	Jul	27	10:30	Lessing.new
129	Jul	27	10:30	Boyle.new
96	Jul	27	10:30	Aelian.new
48	Jul	27	10:30	Zichy.new
64	Jul	27	10:30	Nomura.new
204	Jul	27	10:30	Takeda.new
21	Jul	27	10:30	Gilbert
265	Jul	27	10:30	Batista.new
939	Jul	27	10:30	Andrássy.new
544	Jul	27	10:30	Prabhu.new
165	Jul	27	10:30	Tyszkiewicz.new
22	Jul	27	10:30	Mommsen.new
251	Jul	27	10:30	Köppen.new
492	Jul	27	10:30	Della
168	Jul	27	10:30	Bernstein.new
32	Jul	27	10:30	Tippett.new
380	Jul	27	10:30	Sanseverino.new
51	Jul	27	10:30	Pucci.new
377	Jul	27	10:30	Hieronymus
113	Jul	27	10:30	Ghirlandaio.new
65	Jul	27	10:30	Beckett.new
711	Jul	27	10:30	O'Ryan.new
273	Jul	27	10:30	Neumann.new
10	Jul	27	10:30	Matsushita.new
1276	Jul	27	10:30	Ferrero.new
114	Jul	27	10:30	Dietz.new
59	Jul	27	10:30	Amorim.new
29	Jul	27	10:30	Wankel.new
594	Jul	27	10:30	Uexküll.new
20	Jul	27	10:30	Stirner.new
80	Jul	27	10:30	Sridhar.new
234	Jul	27	10:30	Rossetti.new
150	Jul	27	10:30	Nassar.new
115	Jul	27	10:30	Morandi.new
160	Jul	27	10:30	Bulgakov.new
25	Jul	27	10:30	Barks.new
136	Jul	27	10:30	Agnelli.new
350	Jul	27	10:30	Teleki.new
134	Jul	27	10:30	Tarnowski.new
574	Jul	27	10:30	Hamdan.new
93	Jul	27	10:30	Guicciardini.new
589	Jul	27	10:30	Clark.new
97	Jul	27	10:30	Borromeo.new
22	Jul	27	10:30	Bazzi.new
51	Jul	27	10:30	Wolf-Ferrari.new
357	Jul	27	10:30	Sylvester.new
26	Jul	27	10:30	Schichau.new
164	Jul	27	10:30	Scarlatti.new
67	Jul	27	10:30	Noriega.new
24	Jul	27	10:30	Bohlen.new
40	Jul	27	10:30	Boiardo.new
45	Jul	27	10:30	Bosman.new
446	Jul	27	10:30	Braun.new
9	Jul	27	10:30	Gabrielli.new
56	Jul	27	10:30	Haider.new
49	Jul	27	10:30	Jayachandran.new
72	Jul	27	10:30	Jellinek.new
332	Jul	27	10:30	Manning.new
28	Jul	27	10:30	Naryshkin.new
157	Jul	27	10:30	Sachs.new
118	Jul	27	10:30	Sacks.new
101	Jul	27	10:30	Saunders.new
159	Jul	27	10:30	Uccello.new
204	Jul	27	10:30	Velazquez.new
29	Jul	27	10:30	Wills.new
60	Jul	27	10:30	Bergman.new
759	Jul	27	10:30	Haim.new
18588	Jul	27	10:30	Agamemnon.new
3872	Jul	27	10:30	Antigone.new
33458	Jul	27	10:30	Bloomsbury.new
36678	Jul	27	10:30	Cabaret.new
494	Jul	27	10:30	Can-Can.new
23895	Jul	27	10:30	Carousel.new
7172	Jul	27	10:30	Cyrano
47072	Jul	27	10:30	Dune.new
13573	Jul	27	10:30	Euphoria.new
6460	Jul	27	10:30	Falstaff.new
13338	Jul	27	10:30	Faust.new
575	Jul	27	10:30	Fra
1650	Jul	27	10:30	Gidget.new
16873	Jul	27	10:31	Gladiator.new
85498	Jul	27	10:31	Julius
10409	Jul	27	10:31	Medea.new
7415	Jul	27	10:31	Mystic
536	Jul	27	10:31	Peaky
9674	Jul	27	10:31	Peer
16265	Jul	27	10:31	Pericles.new
60538	Jul	27	10:31	Quartz.new
9418	Jul	27	10:31	Salome.new
49778	Jul	27	10:31	St.
84	Jul	27	10:31	The
9885	Jul	27	10:31	Ansible.new
20259	Jul	27	10:31	Arrow.new
57727	Jul	27	10:31	Daily
672758	Jul	27	10:31	The
8853	Jul	27	10:32	Decanter.new
11944	Jul	27	10:32	Dissent.new
13559	Jul	27	10:32	Germania.new
7858	Jul	27	10:32	Guernica.new
29403	Jul	27	10:32	Life.new
6739	Jul	27	10:32	The
809	Jul	27	10:32	The
195831	Jul	27	10:32	The
13864	Jul	27	10:32	Referee.new
2987	Jul	27	10:32	Sunday
24360	Jul	27	10:32	Sunday
154416	Jul	27	10:32	The
5692	Jul	27	10:32	Cage.new
872	Jul	27	10:32	Carpenters.new
2853	Jul	27	10:32	Chrysalis.new
133	Jul	27	10:32	Doors.new
324	Jul	27	10:32	Fernando.new
62059	Jul	27	10:32	Grenade.new
38621	Jul	27	10:32	Guru.new
125	Jul	27	10:32	Happy.new
970	Jul	27	10:32	Hello.new
190	Jul	27	10:32	Jojo.new
13288	Jul	27	10:32	Pink.new
84108	Jul	27	10:33	Sugar.new
16057	Jul	27	10:33	anchorage.new
25	Jul	27	10:33	barks.new
105737	Jul	27	10:33	batman.new
109392	Jul	27	10:33	derby.new
166471	Jul	27	10:33	jersey.new
107237	Jul	27	10:33	limerick.new
121643	Jul	27	10:33	louvre.new
332	Jul	27	10:33	manning.new
7545	Jul	27	10:33	march.new
99124	Jul	27	10:34	piedmont.new
118	Jul	27	10:34	sacks.new
1443	Jul	27	10:34	sandbanks.new
26151	Jul	27	10:34	slough.new
255991	Jul	27	10:34	surrey.new
50366	Jul	27	10:34	troy.new
29	Jul	27	10:34	wills.new
523	Jul	27	10:34	The.new
523	Jul	27	10:34	the.new
48	Jul	27	10:34	Is.new
48	Jul	27	10:34	is.new
337	Jul	27	10:34	were.new
199	Jul	27	10:34	That.new
199	Jul	27	10:34	that.new
370	Jul	27	10:34	said.new
1155	Jul	27	10:34	One.new
1155	Jul	27	10:34	one.new
5430	Jul	27	10:34	goes.new

Bot updating Webarchive template is adding "url" same as existing "url2" edit

This bot made a group of WaybackMedic 2.5 edits in June where it "rescued" an archive link in the |url= parameter of {{Webarchive}}, replacing it with a this link which was already in the |url2= parameter. Two examples of this are Grant Bramwell: revised 1 June 2022 and List of ICF Canoe Sprint World Championships medalists in men's kayak: revised 26 June 2022. Can the bot remove the duplicate url2/date2/title2 parameters and renumber any subsequent url3/date3/title3, etc.? I've fixed over 500 of these edits myself, but there are still over 700 remaining to be fixed. Thanks. -- Zyxw (talk) 03:54, 9 August 2022 (UTC)Reply

That was part of the deprecation of WebCite which is a dead archive provider. It didn't account for dups. It's complicated here because even though |url= and |url2= are the same, |title= and |title2= are different - which do you choose. I think the best course is the keep |url= set and remove the |url2= set, at least based on two examples. In terms of renumbering that is not required as the webarchive template is designed to allow any numbers up to 10, so long as there is a |url= .. aka |url1= .. is the only requirement. I'll start looking at this today. -- GreenC 15:35, 9 August 2022 (UTC)Reply
@GreenC: I agree with keeping the |url= set and removing the |url2= set when there is a duplicate URL and that is what I did for the 500+ already fixed. I also thought {{Webarchive}} might automatically handle the missing |url2= set and display the |url3= set, but as per these tests that is not the case:
archive with url/date/title, url2/date2/title2, and url3/date3/title3
url2/date2/title2 removed with url3/date3/title3 remaining
url2/date2/title2 removed and url3/date3/title3 renumbered
-- Zyxw (talk) 16:15, 9 August 2022 (UTC)Reply
Reported at Template_talk:Webarchive#Gaps_in_argument_sequence. I wrote the template originally but Trappist did a major rewrite so I'm not sure if that is my bug or his. I processed the first 500 articles and there are only 3 with a |url3= suggesting 40 or 50 at most in the whole bunch. Anyway it won't be difficult to renumber them. -- GreenC 16:26, 9 August 2022 (UTC)Reply
Ah miscalculated it's 733 not 7,330 :) It's done see anything more let me know. -- GreenC 17:08, 9 August 2022 (UTC)Reply
Fixed the webarchive bug. -- GreenC 18:06, 9 August 2022 (UTC)Reply

Bad webcitation link replacement edit

So I've just found out that GreenC bot made edits like this, replacing a dead archive link with another dead archive link. Would it be possible to replace that archive link with, say, this one that actually works? Thanks very much! Graham87 11:48, 26 August 2022 (UTC)Reply

Bots are not 100% perfect. It relies on the Wayback API to determine live links and it is not perfect so for those errors it depends on human intervention to correct. The alternative is not to use bots at all , in which case most links never get fixed at all due to the scale, it's back-end boring work people want bots to do, but there is not guarantee bots, or for that matter people, will not make mistakes. The question is the scale of mistakes. -- GreenC 15:08, 26 August 2022 (UTC)Reply
Yeah fair enough, soft 404's and all. On re-reading my message I spectacularly failed at phrasing it clearly ... there are nearly a hundred more such links; could you instruct the bot to replace them with a working archive (i.e. the one linked above)? I thought that would be the easiest way to fix this problem. I tried changing the archive link on InternetArchiveBot's side and asking it to fix the affected articles, but that didn't do what I intended. Graham87 13:34, 27 August 2022 (UTC)Reply
OK it's done. Yeah there's no way to automate replace of one archive with another via IABot. That would be a good feature though when finding soft-404s. -- GreenC 16:16, 27 August 2022 (UTC)Reply
Opened Phab T316438 .. no idea if or when. -- GreenC 16:34, 27 August 2022 (UTC)Reply

Avoid editing inside HTML comments edit

GreenC bot now edits inside HTML comments eg. Special:Diff/1107954452, but I suggest it not to. Although the edit in this example happened to be harmless (even useful), in general, comments could be used for a wide range of reasons, so there is a higher risk that automatic edits could break their intentions. Wotheina (talk) 03:49, 2 September 2022 (UTC)Reply

That's true but there is a positive trade-off so for a couple reasons I am OK fixing certain (not all) link rot in comments, as I have been doing for 7 years. If someone wants to preserve a block of immutable wikitext they should use the talk page, user page or offline - otherwise anyone can edit the comment or delete it entirely. Comments can be strangely formatted, I take measures, auto and manual, to check commented text before posting a live diff. -- GreenC 05:39, 2 September 2022 (UTC)Reply

Stopping backlinks report during wikibreak edit

Hello, and thanks again for the useful Backlinks reports. I'm currently taking a Wikibreak and have attempted to exclude my list from the bot's tasks thus but it still ran today. It's not a problem for me if the reports continue but, if you'd like to save some resources by stopping it properly, please go ahead. Certes (talk) 11:25, 5 September 2022 (UTC)Reply

Fixed, it was seeing Action=RUN in the "#" comment. First time this code has been tested :) Have a good break. -- GreenC 05:14, 6 September 2022 (UTC)Reply

Please Update the monthly list of Top 10000 wikipedia users by Article Count edit

Please Update the monthly list of Top 10000 wikipedia users by Article Count which changes every 1st and 15th date of a month. Abbasulu (talk) 07:52, 3 October 2022 (UTC)Reply

It's still running for some reason very slowly in 3 days it only completed 19%. -- GreenC 12:51, 3 October 2022 (UTC)Reply

Exactly what purpose did this edit serve? Edit summary is misleading at best edit

https://en.wikipedia.org/w/index.php?title=Rodney_Marks&diff=1095741886&oldid=1091111369 108.246.204.20 (talk) 20:17, 3 October 2022 (UTC)Reply

Don't use {{dead link}} if the citation has a working |archive-url=. -- GreenC 20:46, 3 October 2022 (UTC)Reply
it doesn't. "this page is not available". 108.246.204.20 (talk) 04:15, 14 October 2022 (UTC)Reply
Ah soft-404. Removed. O also updated the IABot databace. -- GreenC 04:24, 14 October 2022 (UTC)Reply

A cookie for you! edit

  Ulises12345678 (talk) 11:00, 9 October 2022 (UTC)Reply
Thank you. For the Cookie. -- GreenC 14:12, 9 October 2022 (UTC)Reply

RSSSF edit

Why is this bot changing "website=rsssf.com" to "website=RSSSF", where there is already "publisher=RSSSF" parameter, and then in many pages you get stupid outcome like this with double RSSSF linking? Snowflake91 (talk) 10:27, 7 February 2023 (UTC)Reply

Yeah it's not ideal, a work in progress. In any case the problem is there should not be both |work= and |publisher= use one or the other not both. And should not use a domain name, use the name of the site, is best practice on Wikipedia. The re are so many RSSSF citations, and so many problems with them, I've done a lot of work to fix them but there are still things that need more work. -- GreenC 15:22, 7 February 2023 (UTC)Reply
Prefer |website= over |publisher=. {{cite web}} does not include |publisher= in the citation's metadata.
Trappist the monk (talk) 16:18, 7 February 2023 (UTC)Reply
Special:Diff/1038698982/1138241646 -- GreenC 21:44, 8 February 2023 (UTC)Reply

I think all the doubles are cleared, if you see any more or other problems let me know. -- GreenC 21:45, 8 February 2023 (UTC)Reply

WaybackMedic edit

@GreenC: It seems that WaybackMedic 2.5 is running by GreenC bot 2. However, I can't find its source code of version 2.5 in the Github repo. I need to read the latest code to learn its current behavior. Have you published it yet? -- NmWTfs85lXusaybq (talk) 14:04, 24 March 2023 (UTC)Reply

I can send snippets or functions if you want for anything you are interested in. The entire codebase is not currently available for public due to containing some proprietary information. It's written in Nim, and some awk utils. -- GreenC 14:44, 24 March 2023 (UTC)Reply
The bot detection of businessweek.com you mentioned in Wikipedia:Village_pump_(technical)/Archive_203#businessweek.com_links may be bypassed by simply assigning an user agent of a web browser in the header of http requests, such as Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.50 Safari/537.36. As far as I know from version 2.1, WaybackMedic may execute external commands (via execCmdEx) to determine page status and the assignment of user-agent should be easily implemented via some available parameters. By the way, as of version 2.1, I can see the validate_robots function is implemented in medicapi.nim. -- NmWTfs85lXusaybq (talk) 16:55, 24 March 2023 (UTC)Reply
Thank you for the suggestion to use a browser agent. I tried it, they appear to limit based on query rate, and it's pretty sensitive. I was able to trigger it by manually requesting 8 headers rapidly then it stopped working, sending a header with "HTTP/1.1 307 s2s_high_score" and redirect to a javascript challenge ("press and hold button"). Maybe I could slow the bot down enough between queries, it would be difficult, and extremely slow, perhaps a month or longer for 10k articles, and would need to verify every header is not 307 otherwise abort and manually clear the challenge. GreenC 21:36, 24 March 2023 (UTC)Reply
If they limit the query rate based on ip, you can find some web proxies to accelerate this procedure as your bot may behave like a web crawler. After you collect and validate some free proxies, you can just apply them alternately to your bot, although their stability is not guaranteed. -- NmWTfs85lXusaybq (talk) 03:47, 25 March 2023 (UTC)Reply
I have access to a web proxy that uses home based IPs and it still didn't work. Maybe the solution is to pull every URL into a file and process them outside the bot with a simple script that waits x seconds between each header query. Then feed the results to the bot which URLs are dead. It can run for however long it wouldn't matter. Trying to do it inside the bot is too error prone too complicated and ties up the bot too long. -- GreenC 04:11, 25 March 2023 (UTC)Reply
It's a good idea to run this job outside the bot. However, I'm not sure what you mean by a web proxy that uses home based IPs. Have you tried high-anonymity proxies? Did you change proxy IP every time you made a new request? NmWTfs85lXusaybq (talk) 04:45, 25 March 2023 (UTC)Reply
The IPs change with every request, and the IPs are sourced to home broadband users globally, so they are not detectable by CIDR block. I don't know how they got blocked, maybe Cloudflare is on this service and recorded all of the IPs. -- GreenC 14:46, 25 March 2023 (UTC)Reply
Then I suppose your proxy strategy is OK. Please make sure your web proxy has high anonymity if all of your configuration works fine. -- NmWTfs85lXusaybq (talk) 15:20, 25 March 2023 (UTC)Reply
I ran this bot-block avoidance script and it took forever. What I discovered is just about every link should be archived. Either 404, soft-404 or better-off-dead. The later because the links went to content that was behind a paywall or otherwise messed up in some way - so the archived version is better in nearly every case. -- GreenC 14:17, 3 April 2023 (UTC)Reply
I see you mentioned some awk scripts as a workaround at Wikipedia:Link_rot/URL_change_requests#businessweek.com. However, I can't find the meta directory businessweek.00000-10000 you referred to in the Github repo of InternetArchiveBot and WaybackMedic. NmWTfs85lXusaybq (talk) 07:15, 24 April 2023 (UTC)Reply
Oh that's a note to myself, if you want the awk script let me know it's nothing more than going through a list of URLs, pausing between each to avoid rate limiting, getting the headers and recording the results and if it's a bot block header notify and abort the script. It also shuffles the agent string. It seemed to learn agent strings and block based on those which could be avoided by retiring an agent and adding a new one. -- GreenC 13:47, 24 April 2023 (UTC)Reply

Backlinks report 2023 edit

User:Certes/Backlinks/Report has stopped updating. The bot is running, as User:GoingBatty/Backlinks/Report still updates. I've not changed the job list in User:Certes/Backlinks since 8 May, nor pressed the stopbutton. Do you know how to restart the report please? Certes (talk) 12:17, 4 June 2023 (UTC)Reply

The process from June 2nd crashed for unknown reason and turned into a zombie preventing future runs. I can't kill it so I contacted Toolforge admins for help. -- GreenC 14:17, 4 June 2023 (UTC)Reply
Working again now – thanks! Certes (talk) 21:50, 4 June 2023 (UTC)Reply

Archiving chapter urls edit

This is a bit of an edge case with GreenC bot's archive repair task, so I wanted to get your opinion. In several articles where I'm citing an archived book that has separate PDFs for each chapter, I use the |archive-url= parameter for the chapter url (since that's the most important one) and have a Wayback url for the book url in the |url= field. It's not ideal, but I'm not sure how else to handle it. My brief search also found this thread where you indicated that |archive-url= was okay to use for the chapter url. However, GreenC bot switches the |archive-url= field to be the archive of the |url= field (example here).

Is there a better way to format these citations? I'm not able to find any. Otherwise, is there any way I can mark the citations to be ignored by the bot? This seems like a relatively rare case; I imagine it's not worth modifying the bot to handle. Thanks, Pi.1415926535 (talk) 22:14, 14 August 2023 (UTC)Reply

Special:Diff/1170358971/1170410520. Another option:
Vanasse Hangen Brustlin, Inc (August 2005). Beyond Lechmere Northwest Corridor Study: Major Investment Study/Alternatives Analysis. Massachusetts Bay Transportation Authority. Archived from the original on July 5, 2016. Chapter 4: Identification and Evaluation of Alternatives – Tier 1 at the Wayback Machine (archived 2016-07-05)
I like this better because it doesn't hack the cite book template arguments. The downside is the display is a little messier. Another way with some duplication:
Vanasse Hangen Brustlin, Inc (August 2005). "Chapter 4: Identification and Evaluation of Alternatives – Tier 1". Beyond Lechmere Northwest Corridor Study: Major Investment Study/Alternatives Analysis. Massachusetts Bay Transportation Authority. Archived from the original (PDF) on July 5, 2016. From Beyond Lechmere Northwest Corridor Study: Major Investment Study/Alternatives Analysis at the Wayback Machine (archived 2016-07-05)
To keep the bot off the citation add {{cbignore}} template after the end of the cite book but inside the ref tags. -- GreenC 02:17, 15 August 2023 (UTC)Reply
Thanks, much appreciated. Pi.1415926535 (talk) 17:15, 15 August 2023 (UTC)Reply
@GreenC: Please take a look at Special:Diff/1171111146, where the bot edited several citations already tagged with {{cbignore}}. Thanks, Pi.1415926535 (talk) 06:35, 21 August 2023 (UTC)Reply
I found two problems. 1) The {{cbignore}} should follow directly after the template it targets: Special:Diff/1171510462/1171514730 - I think the cbignore docs has this. 2) My bot has a known limitation. Within any block of text between new lines (ie. a paragraph of text), if there is more than one cbignore, the citations the cbignore follows all need to be unique. In this case the two citation are mirror copies. The bot ignored the cbignore for that reason (it has to do with disambiguate it needs to know which citation to target). So, I modified one of the citations, they are now unique: Special:Diff/1171514730/1171514803 (changed the semi-colon to colon in the publisher field for the first citation) -- a bit quirky but tested and it works now. I do recommend though using the alt suggestions above because while my bot honors cbignore most other bot's do not and eventually in the future it's probable some other tool will try to "fix" what it detects as an error (archive URL in the url field). -- GreenC 15:45, 21 August 2023 (UTC)Reply

Incorrect dead flags and archive.today edit

Hello GreenC! Your bot recently made this strange edit to Pokémon. In it, the bot changed "archive.is" and "archive.ph" to "archive.today". I'm not sure what purpose this has. The task is not explained on User:GreenC bot.

Furthermore, the bot flagged these three sources as dead:

But as you can see, the above links are not dead. So something must've gone wrong there. I've remarked these refs as live. Cheers, Manifestation (talk) 11:04, 19 August 2023 (UTC)Reply

Archive.today is what the owner of archive.today wants us to use, it's a redirector that sends traffic to other domains as they are available. The reason those three got marked dead is there was an archive URL in the |url= field and the bot moved it to the |archive-url= field and the bot assumes if someone put an archive URL in the main |url= field it was probably a dead URL. -- GreenC 14:47, 19 August 2023 (UTC)Reply
@GreenC: Aaah! So that's why. I wrote the text, so I take full responsibility for the url= / archive-url= mixup. As for archive.today: I looked at our article, and it cites this tweet from 4 January '19 in which the owner states that the .is domain might stop working soon. However, the domain is still active. In fact, the '@' handle used by the account to this day is still "@archiveis". I've used archive.today many times, including this year. It always gave me either a .is or a .ph link. Cheers, Manifestation (talk) 15:07, 19 August 2023 (UTC)Reply
Yeah it redirects to one of the 6 domains like .is or .ph .. but if one of those domains gets shut down by the registar, he can switch where it redirects to easily, without having to change every link on Wikipedia. -- GreenC 15:24, 19 August 2023 (UTC)Reply
Hmm ok. Well I guess we should honor his/her request then. For the sake of clarity, maybe the description of Job #2 / WaybackMedic 2.5 on User:GreenC bot could be expanded a little to include a mention of archive.today? archive.today is not part of the Internet Archive, so the term "WaybackMedic" is a bit misleading. - Manifestation (talk) 16:03, 19 August 2023 (UTC)Reply
Alright I updated fix #21 which also now links to Help:Using_archive.today#Archive.today_compared_to_.is,_.li,_.fo,_.ph,_.vn_and_.md. It started out as Wayback-specific then expanded to all archive providers but I kept the original name anyway. -- GreenC 16:41, 19 August 2023 (UTC)Reply
@GreenC Hi! I know that .today is the domain to be used, but every time i try to open a link with .today it returns me a "This site cannot be reached" type of error, and the same goes with .ph links. The only active links i get are the one with .is Astubudustu (talk) 10:55, 2 April 2024 (UTC)Reply
This is because the DNS resolver you are using is hosted on CloudFlare and that won't work (well) with archive.today domains see Archive.today#Cloudflare_DNS_availability -- GreenC 15:38, 2 April 2024 (UTC)Reply

WaybackMedic 2.5 adding unneceesary URLs edit

I saw the bot's task run on Guardians of the Galaxy (film) here and it made edits to three references that used {{Cite Metacritic}}, {{Cite Box Office Mojo}}, and {{Cite The Numbers}}, adding in unnecessary URLs and marking the links as dead. The citation templates construct the urls from the given parameters (as most follow a common format on those sites) and were not dead. Didn't know if this was a bot issue, or the templates themselves doing something that is flagging the citations to make the bot adjust them. I can look into the templates to see what the issues may be if that is ultimately the case (and to know what to look for for the error). - Favre1fan93 (talk) 14:16, 24 August 2023 (UTC)Reply

That is a bot error. It is in 9 articles. I rolled them back (you got 2). Thanks for the report. -- GreenC 15:00, 24 August 2023 (UTC)Reply
No problem, thank you! - Favre1fan93 (talk) 15:26, 24 August 2023 (UTC)Reply

Timestamp mismatch edit

This bot is changing the archive-url as seen here, but it is not changing the archive-date as required, creating a timestamp mismatch error, as seen here. I just recently emptied this category and now it has over 80 articles (when I wrote this) in it again. Your help would be appreciated. Thanks. Isaidnoway (talk) 05:57, 2 September 2023 (UTC)Reply

I am aware, did it in two steps, because of the way this particular job was programmed, it was easier this way. You saw it in that 30-minute gap between runs-- GreenC 16:11, 2 September 2023 (UTC)Reply

My bot can empty that category easily. It was 40,000 a week ago. Got it down to few hundred edge cases, which I assume you fixed manually, thank you. I'd like to fully automate it, but right now it's all integrated into WP:WAYBACKMEDIC which can't be fully automated, so I run it on request. -- GreenC 16:16, 2 September 2023 (UTC)Reply

User:Isaidnoway, I'm running a bot job to convert archive.today URLs from short-form to long-form. Example. It is exposing old problems with date mismatches that are showing up in Category:CS1 errors: archive-url -- after this bot job completes, I'll run another bot to fix the date mismatches, it will clear the tracking cat. No need to do anything manually. -- GreenC 04:57, 8 September 2023 (UTC)Reply

Hi GreenC! My bot is following yours today. There were several instances when your bot reformatted archive URLs like this edit, mine fixed the archive dates like my bot did in the following edit. My bot is running on Category:CS1 errors: dates, and pulling the archive date from the archive URL. Any chance your bot could do it all in one edit? Thanks! GoingBatty (talk) 18:25, 8 September 2023 (UTC)Reply
I used to be able to fix archive.today problems and date mismatches in the same process, but it was semi-automated. Fixing archive.today problems can and should be full-auto, so I separated that out to its own process that uses EventStream to monitor real-time when a new short-form link shows up, log the article name, and once a month or so fix them - all full-auto. Across 100s of wikis. The downside is this program can't fix date mismatch problems. I want to fix date mismatches automatically, and hope to do that eventually with its own process. Once I have that developed I can see about including it in the archive.today program, so it saves the extra edit, when the source of the date mismatch is archive.today short to long conversion.
The tracking category will be cleared in the next few hours, it's currently generating diffs. This is a one-off event clearing out the backlog of archive.today problems which exposed a lot of problems. Going forward there will be much smaller numbers. We both currently have bots that can clear that category on request, do you know how to update the docs for the category page? -- GreenC 23:41, 8 September 2023 (UTC)Reply
Not sure which category page you're referring to, but most of the text on these category pages comes from Help:CS1 errors, so if you updated the help page, it would also appear on the appropriate category page. GoingBatty (talk) 03:15, 9 September 2023 (UTC)Reply
Category:CS1 errors: archive-url. Do you want me to include your bot in the doc as available to clear the cat on-request? I'm going to mention WaybackMedic is available, but only if there are more than 500 entries. -- GreenC 14:25, 9 September 2023 (UTC)Reply
I don't have a bot to clear Category:CS1 errors: archive-url. GoingBatty (talk) 18:21, 9 September 2023 (UTC)Reply
Oh I see I misinterpreted what you said above I thought it was fixing mismatched dates but it was actually fixing an incomplete date. -- GreenC 19:12, 9 September 2023 (UTC)Reply

Economy of Zimbabwean edit

I need some help Mindthem (talk) 21:13, 25 September 2023 (UTC)Reply

@Mindthem: How would you like the bot to help with the Economy of Zimbabwe article? GoingBatty (talk) 19:20, 29 September 2023 (UTC) (talk page stalker)Reply

Backlinks edit

Hi there! I see your bot delivered a new Backlinks report for Certes, but I didn't receive an update today. Could you please give the bot a nudge? Thanks! GoingBatty (talk) 19:21, 29 September 2023 (UTC)Reply

I saw some messages this morning Toolforge was down due to NFS, likely your run didn't complete before the outage. I see it aborted around 09:32GMT and Certes finished at 09:28 .. with minutes to spare. I'll run yours again now. -- GreenC 19:37, 29 September 2023 (UTC)Reply
Report received - thank you! GoingBatty (talk) 02:49, 30 September 2023 (UTC)Reply

Bot put italics in strange places edit

I don't know what happened here, but the bot appears to have put italics in place where they didn't belong, and then missed putting them in where they did belong. Given that the bot had to edit three times, I imagine this bot run was stressful for you. If this code is still active, it might need yet another debugging. – Jonesey95 (talk) 18:26, 19 October 2023 (UTC)Reply

Yeah this was a pain, every time I thought it was done, some new issue came up. And getting those ticks right, in the right place, after the fact, wasn't easy. Anyway this task is done for me (1,200 articles deletion of {{BFI}}). If you see any problems they need manual adjustment. I don't think the number of problems is very large from spot checking. -- GreenC 18:35, 19 October 2023 (UTC)Reply
I think you are correct, based on my perusal of the list of Linter errors. – Jonesey95 (talk) 18:54, 19 October 2023 (UTC)Reply

Flagging non-dead link as dead (2) edit

Hello. Why did GreenC bot rewrite url-status=live to url-status=dead in Special:Diff/1186567077 for a live URL? The URL [1] is alive, at least from Japan as of 2023-11-24 04:50 UTC (checked with Firefox and Chrome on Windows 10). Wotheina (talk) 05:05, 24 November 2023 (UTC)Reply

It's freemimum content. Open an incognito window and see if it gives a different result. I tried to archive premium content pages for NatGeo because they use a freemium wall. View page source and search on "freemiumContentGatingEnabled". -- GreenC 05:42, 24 November 2023 (UTC)Reply
I see. I agree on switching from paywalls to archives, but for such unintuitive edits please write the intention somewhere, as in edit summary or embedded comment, or at least in User:GreenC/WaybackMedic 2.5. I think url-access= is the best way, but I guess you are not using that because there is no option "url-access=freemium" yet. Wotheina (talk) 06:46, 24 November 2023 (UTC)Reply
|url-access=freemium is a great idea. Until it appears, I think |url-access=live is less bad, or for a bonus point |url-access=live<!--freemium--> which can be converted in bulk later. I can see the goats too, but I block a lot of third-party scripts which might hide them in standard browsing. Certes (talk) 16:25, 8 December 2023 (UTC)Reply
Regarding "|url-access=live is less bad", did you mean "|url-status=live is less bad"? Wotheina (talk) 17:24, 8 December 2023 (UTC)Reply
Yes, sorry, I was confusing the two parameters. |url-access=live seems more accurate than |url-access=dead here. The least bad value for status might be |url-status=limited. I can't find a definition of limited to determine whether freemium falls within its scope. Certes (talk) 18:34, 8 December 2023 (UTC)Reply
When I did NatGeo, I didn't have the ability to add archive URLs with |url-status=live so unfortunately they were all set to dead. I have since added this ability after it was requested at Wikipedia:Link_rot/URL_change_requests#vh1.com by User:Alexis Jazz. I'm not sure about going back and resetting from dead to live the NatGeo links that are freemium, that would probably require some special one-off code and a lot of time to recheck all the links. But it's the kind of thing anyone could probably do pretty easily, if you have code to parse and edit CS1 templates. -- GreenC 17:34, 8 December 2023 (UTC)Reply

Backlinks timing edit

Hi there! I noticed that the Backlinks report hasn't run yet today for Certes or me. Looking at the bot's contributions, I see the report is running later each day this week. Could you please check the bots to see what's going on? Thank you! GoingBatty (talk) 15:22, 8 December 2023 (UTC)Reply

I started monitoring Buenos Aires as an experiment, not because its new links are likely to be wrong but because socks of a certain puppetmaster love linking to it. I've just removed it from my list, in case this widely-linked page is causing problems. Certes (talk) 16:18, 8 December 2023 (UTC)Reply
They are forks of the same script, they run on different cron jobs and directories, thus not be possible to effect each other. If both are not working I dunno I'll check. -- GreenC 17:40, 8 December 2023 (UTC)Reply

GoingBatty & Certes, I found a bug that only shows up when running from cron. It wasn't apparent when the script was on Toolforge because there you signify the working directory with -wd= with the jsub command which masked the problem. The effect of the bug was to create duplicate entries in the list at /Backlinks which is why it kept taking longer each run. For example GoingBatty had 7 instances of "hamlet" (from the scripts perspective), one for the original and 6 for each day the script ran. So I think the best solution is wipe out the data files again and start over, the data files look kind of weird anyway. The usual, you'll see the message about new entries, then the next one should be good. -- GreenC 18:20, 8 December 2023 (UTC)Reply

On December 8, the bot started over and published a report, but didn't publish a report for December 9. Could you please check it again? Thanks! GoingBatty (talk) 04:34, 10 December 2023 (UTC)Reply

GoingBatty, I don't know what happened. Nevertheless, it is working now. It looks system-level. Cron logs show the process ran, but it didn't. No apparent reason, and I can't replicate. Weird. Let me know if it doesn't run again, I enabled verbose logging. Also during testing I moved the job time to around 5:30 GMT .. or do you want the previous 8:30? Or some other time? -- GreenC 06:01, 10 December 2023 (UTC)Reply

Thank you! I'd prefer the previous 8:30, as I'm likely to see the 5:30 job right before I should be going to sleep, and then be tempted to stay up too late to address them immediately. Thanks! GoingBatty (talk) 07:04, 10 December 2023 (UTC)Reply

User:Certes during testing your most recent report lost some data, seen below. -- GreenC 06:01, 10 December 2023 (UTC)Reply

Thanks; I'll take a look at those. I've a slight preference for 0830 over 0530, as I tend to look at the entries about 1000-1200 UTC and the fresher the better. Certes (talk) 16:07, 10 December 2023 (UTC)Reply

It didn't run again. The logging helped. I'm narrowing in on the problem and made some changes. We'll see what happens next run. -- GreenC 21:18, 11 December 2023 (UTC)Reply

At some point when this issue is resolved, are you willing to open Backlinks to other users? For example, see Wikipedia:Help desk#Notification for Links to Pages by Other Users. Thanks! GoingBatty (talk) 04:18, 12 December 2023 (UTC)Reply

So, it does appear my IP is being rate limited by WMF. I moved all my tools off-site and it's generating a lot of traffic. The solution is to add a retry loop with pauses. Will try that next. -- GreenC 14:42, 12 December 2023 (UTC)Reply

Would moving the tools on-site be a solution? I know they just made that a whole lot more difficult by deprecating GridEngine. Certes (talk) 14:49, 12 December 2023 (UTC)Reply
That will take time because I think it will require building a custom kerbenos image which is a learning curve. I have a ticket open asking them about this but no reply yet. I should have been using a retry loop anyway so this will help either way, I have a function, but was apparently lazy and didn't call it. -- GreenC 15:26, 12 December 2023 (UTC)Reply
A lot of people will be climbing the same learning curve. It would be nice if we had a page for giving each other a leg up. Sadly (or perhaps gratefully), I've never had to use Kubernetes and so can't be of much assistance. Certes (talk) 16:17, 12 December 2023 (UTC)Reply
I hope to learn the system eventually, probably good thing to know. -- GreenC 18:02, 12 December 2023 (UTC)Reply

Ran both manually with the new code. It will keep requesting when it gets a 429 ("Too many requests"). It tries 20 times with a 2 second delay. I have seen it make up to 5 requests, but it will depend on WMF server load. The jobs will run on the regular morning schedule tomorrow. -- GreenC 18:02, 12 December 2023 (UTC)Reply

If it's not too much work, escalating the delay might be good for both the program and the server, e.g. if the nth try fails, wait n seconds. (Exponential is recommended but seems extreme.) Certes (talk) 18:15, 12 December 2023 (UTC)Reply
There are too many tool making constant requests it almost doesn't matter, they are going to saturate regardless. I'm concerned because if slowed down too much the work never gets done. Will keep on it. It will email if/when it reaches 20. -- GreenC 19:49, 12 December 2023 (UTC)Reply
Hmmm. It sounds as if they need a bigger computer. They can afford it. Certes (talk) 22:33, 12 December 2023 (UTC)Reply

Everything looks good today. Thank you. The only difference from before is that the output now appears alphabetically by target rather than sorted as in the parent page, but that's not a problem. Certes (talk) 10:13, 13 December 2023 (UTC)Reply

Because there were duplicates in the parent page I had to unique the list which required a sort. I tried to unique it in a way that doesn't require a sort ie. cat file.txt | awk '!s[$0]++' > out.txt, but for some reason it dropped one of the entries.. I didn't have time to investigate it so went with the tried and true method of sort file.txt | unique > out.txt. You can try this yourself with the list of entries and see if the results differ in the number of entries on output compared to input. -- GreenC 15:43, 13 December 2023 (UTC)Reply
That sounds very reasonable. (sort -u may work on your system too.) Certes (talk) 16:33, 13 December 2023 (UTC)Reply

Buck Goldstein edit

Hi there! In this edit, your bot changed an incorrect |url= parameter, which added the article to Category:CS1 errors: URL‎‎. Should the bot have done something different, or should it ignore the |url= parameter and only update the |archiveurl=/|archive-url= parameter? Thanks! GoingBatty (talk) 06:02, 18 December 2023 (UTC)Reply

You mean Special:Diff/1187499427/1190066019. The bot that runs this process is a global bot, it is not programmed to handle templates in different languages, it only operates on the URL itself, not with template knowledge. The bot didn't do anything wrong, that wasn't already there; it's only purpose is to normalize archive.today URLs wherever they happen to be. If that caused the pre-existing error to be exposed in the tracking cat, it's a step forward. -- GreenC 06:32, 18 December 2023 (UTC)Reply

Preserving the correct archived version of archive.today links edit

In this edit, WaybackMedic 2.5 attempted to reformat a link to archive.today that had multiple different archives, but used the archive of the wrong date. The pre-existing link https://archive.is/2Ljk6 is an archive from 24 November 2023. The link should have been converted to http://archive.today/2023.11.24-014538/https://www.bloomberg.com/press-releases/1999-11-08/pokefans-can-now-eat-their-hearts-out-with-candy-planet-s (the "long link" for the page), but was instead converted to https://archive.today/20231124014538/https://www.bloomberg.com/press-releases/1999-11-08/pokefans-can-now-eat-their-hearts-out-with-candy-planet-s , which corresponds to the 6 December 2023 archive. This resulted in the new archive link leading to an archive of a 404 page instead of the successfully archived page, and the archive-date parameter not matching the timestamp on the page or in the long URL.

Ideally, the bot would notice when the new URL's archive date does not match the old URL's archive date and not make the edit if it cannot resolve this. Also, ideally it would catch when the citation template's archive-date doesn't match the URL's archive date, and either adjust the template's archive-date or display some kind of warning. SnorlaxMonster 12:09, 1 January 2024 (UTC)Reply

Actually, there also appears to be an issue on archive.today's end. While the page https://archive.md/2Ljk6 does have a share option that says that http://archive.today/2023.11.24-014538/https://www.bloomberg.com/press-releases/1999-11-08/pokefans-can-now-eat-their-hearts-out-with-candy-planet-s is the correct long URL, as it turns out, that long URL redirects to the 404 archive as well. In cases like that, I think WaybackMedic 2.5 should not change the URL to the long version, until archive.today corrects their long URLs for URLs with multiple archives. --SnorlaxMonster 12:12, 1 January 2024 (UTC)Reply
That's strange. Looks like a one-off error at archive.today .. never seen it before. I can't verify every new long archive.today is the same, because of the resource load on archive.today servers would double, and the time it would take for the bot to finish. Unless there is evidence of a widespread problem, but in 7 years and over half a million conversions this is the first time it's been reported. All I can do for now is add a static string to the code to skip processing when it sees 2Ljk6. Other tools might try to do the same conversion like IABot or possibly Citation Bot. This is a tricky problem to solve long term. Ideally archive.today would be notified, is the correct solution. -- GreenC 19:23, 1 January 2024 (UTC)Reply
I notified archive.today about the specific issue with the long URL via their "report bug or abuse" button, but I have no idea how likely those reports are to get read. I think just manually excluding that specific case is the best option for now.
With regards to validating that the target page is the same, I think it should be as simple as checking the timestamp is the same (ignoring that bug I mentioned in my second message, where the long URL can redirect to the wrong version). I assume whatever API you're using to get the long URL from the short URL returns the archive date of the short URL in the request you are already making—the long URL has the archive date in the URL itself, so to me it seems like it should be possible to validate that the archive date hasn't changed by just comparing those two values, without needing any additional API requests to archive.today. But I also don't know what the code your bot uses, so I can't verify my assumptions about how it works. (I tried taking a look at the GitHub page linked on User:GreenC/WaybackMedic 2.5, but it appears that it is for Wayback Medic 2.1 and doesn't include the fixarchiveis function that's included in Wayback Medic 2.5.) --SnorlaxMonster 13:22, 2 January 2024 (UTC)Reply
There is no API for this. You download the HTML of the short URL page, and the long form is there towards the top (view source search on "long link"). The GitHub code is old, but you can see it here at line 173. If the long form URL goes to a different version of the HTML, as in this case, I would need to download both the short and long HTML page, and run a string comparison to see if they are approximately the same HTML. Thus downloading HTML twice. -- GreenC 22:28, 2 January 2024 (UTC)Reply
Ah okay, I suspected it could just be plain web scraping. Anyway, what I was trying to suggest was just comparing the date in the URL with the date on the HTML page (so there would be no need to resolve the long link). However, I had missed that the date in the long URL you retrieved was the correct one—the issue was entirely that archive.today redirects it. --SnorlaxMonster 23:34, 2 January 2024 (UTC)Reply

bug report edit

At this edit, GreenC bot copied a malformed wayback machine url from |url= into |archive-url=. It ought not to have done it like that.

The wayback machine url is malformed because its timestamp is not an acceptable length (14 digits preferred, 4 or 6 tolerated). cs1|2 emits an error message for single-digit timestamps and another error message when the values assigned to |url= and |archive-url= are the same.

Trappist the monk (talk) 01:46, 30 January 2024 (UTC)Reply

Also, not clear where |archive-date=2007-06-15 came from.
Trappist the monk (talk) 01:49, 30 January 2024 (UTC)Reply

Bug report: Incorrect archive-date edit

Hi there! In this edit, the bot added |archive-date=18990101080101. Is there something you could add to the bot to prevent the addition of incorrect dates such as this? Thanks! GoingBatty (talk) 18:22, 30 January 2024 (UTC)Reply

I do have warnings but apparently was lazy and forgot to check the logs. -- GreenC 20:08, 30 January 2024 (UTC)Reply

bug report (2) edit

Category:CS1 errors: archive-url recently bloomed. I have just fixed these four articles broken by Wayback Medic 2.5:

Every error was a |archive-date= mismatch with the |archive-url= timestamp. |archive-date= was always off by one day; always earlier than the time stamp except for this one from 2024 Noto earthquake.

Trappist the monk (talk) 18:57, 1 February 2024 (UTC)Reply

And then there is this one that is off by a couple of weeks, this one off by a year. So it looks like what I wrote above may not hold much water...
Trappist the monk (talk) 19:08, 1 February 2024 (UTC) 19:37, 1 February 2024 (UTC)Reply

The date mismatch error preexisted. The bot only made it more obvious, so that CS1|2 error-checking is now able to see it. I would prefer to fix the archive-date at the same time as expanding archive.today URLs from short to long form (per RfC requirement). However this task is universal it operates on many wiki language sites, it does not have knowledge of template names or arguments in other languages. It only expands a URL wherever it may be, it doesn't look at templates. That would require another universal bot I guess, that can operate on CS1|2 templates in multiple languages. If you want to write one, I have the approval to run it. The reason the dates are frequently offset by 1 day, users add an archive.today link they just created, set |archive-date= to their relative location, but the archive.today uses UTC time, which has already passed into a new day. The ones offset by a week or year are user entry errors. -- GreenC 21:49, 1 February 2024 (UTC)Reply