Page MenuHomePhabricator

[Epic] Fix and improve Mr.Z's popular pages report
Closed, ResolvedPublic

Description

Mr.Z's popular pages tool appears to be down and not coming back up: https://en.wikipedia.org/wiki/User_talk:Mr.Z-man#Popular_pages. Many apparently miss it. Any chance that Community Tech can look into this?

Existing code at https://github.com/alexz-enwp/popularpages
Existing interface at http://tools.wmflabs.org/popularpages/
Sample report: https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Spiders/Popular_pages
See also Category:Lists of popular pages by WikiProject
Community Wishlist request (#9)

New GitHub repo: https://github.com/wikimedia/popularpages
Development web interface: http://tools.wmflabs.org/popularpages-dev/

Related Objects

StatusSubtypeAssignedTask
Resolved TBolliger
Resolved Niharika
Resolvedkaldari
Resolved Niharika
Resolved Niharika
Resolvedjcrespo
Resolvedkaldari
Resolved Niharika
Resolved Niharika
Resolved Niharika
Resolvedkaldari
ResolvedJohan
Resolved Niharika
Resolved Niharika
Resolved Niharika
Resolved Niharika

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I believe this tool ran off of the pageviews dumps and not the API, and using the latter won't really be feasible. I have been wanting to make a tool or two using the dumps, so maybe we could revive this in the process. The big question I think is where to store the data, as depending on how far back we want to go it can get quite sizable and I don't think hogging disk space on Tool Labs is best.

I hadn't used this tool before. Can someone give a brief about what exactly did it do?

I found the code at https://github.com/alexz-enwp/popularpages/tree/master/poppages but no documentation. :(

I think it showed the top-viewed pages within a given WikiProject, is that correct @kaldari ? If we can get some sort of internal API set up that uses the pageviews dumps, we can revive this tool and also fulfill some requests for total monthly pageviews of all pages in a given WikiProject, specifically WikiProject Medicine.

I think it showed the top-viewed pages within a given WikiProject, is that correct @kaldari ? If we can get some sort of internal API set up that uses the pageviews dumps, we can revive this tool and also fulfill some requests for total monthly pageviews of all pages in a given WikiProject, specifically WikiProject Medicine.

Hmm, if we only need a month worth of data, can't we use the Pageview API like Pageviews tool does?

Hmm, if we only need a month worth of data, can't we use the Pageview API like Pageviews tool does?

Not for every page in a WikiProject. So for Medicine that's something like 12,000 pages * 30 days, so 360,000 individual requests to the Pageviews API. This sort of stuff would have to be generated monthly using dumps as it would easily overload the API. Come to find out West.andrew.g is already doing this for WP:Medicine https://en.wikipedia.org/wiki/User:West.andrew.g/Popular_medical_pages I think he only made that for Medicine per request, so other WikiProjects aren't supported, and I believe it also goes off of the old pageviews dumps and hence is inaccurate compared to the new metrics available for download.

Seems a bit crazy to run this for every WikiProject, but I guess that's what Mr.Z-bot was doing https://en.wikipedia.org/wiki/Wikipedia:Lists_of_popular_pages_by_WikiProject

My understanding is that there was a script that would parse the logs once a month and populate a dedicated table in the database. Then it would use that table to generate stats pages for each WikiProject.

The bot would update WikiProject-specific pages like <Wikipedia:WikiProject Video games/Popular pages> with monthly stats (rank, views, views per day, assessment, importance). See also Category:Lists of popular pages by WikiProject

See T141010: Looks like the Analytics team is possibly adding top pages on a per-WikiProject basis to the core pageviews API. If we are sure this is to happen in Q2 (as it is currently tagged), we shouldn't put considerable effort into reviving Mr.Z-bot for this purpose as writing a new bot from scratch that uses the API should be trivial, I think.

This should be proposed as a task for the upcoming Community Wishlist Survey.

The wishlist proposal related to this issue is listed at #9 out of 265 proposals (technically, tied at #8).

I don't know much about MediaWiki bots.
But maybe we could integrate all this into the analytics pageview pipeline instead of rewriting the bot.
It would *probably* be more scalable and real time.
And we could cross the resulting data set with other dimensions that we already have like country and referrer type, if they are useful.
I guess the main effort on the analytics side would be to collect the information about which pages belong to which wikiprojects.
If the dumps is the only source for this data though, that could be a complication.

I guess the main effort on the analytics side would be to collect the information about which pages belong to which wikiprojects.

@mforns: That data is easily retrieved from the PageAssessments API (or the page_assessments table in the database). For example, all pages that belong to WikiProject Medicine:
https://en.wikipedia.org/w/api.php?action=query&list=projectpages&wppprojects=Medicine

@mforns Real-time data, isn't a concern as these are monthly-generated pages, except at the point of producing them. It's certainly better for performance that readers are loading flat pages than running the real-time data grab every time. Why push the servers when we don't have to? If people want to do specific queries on visits, they can use Massviews.

@kaldari
Awesome, then I think that could be implemented by importing (sqooping) the page_assessments table into Hadoop and then having a Hadoop job that generates this data. If necessary, we could put it in the Pageview API, although we'd need to know if there's enough interest in this data.

@Stevietheman
Sorry for abusing the term real data, my bad, I meant more granular data like hourly or daily data. If there's no interest in i.e. daily data, then there's no need of working on this, agree. But mainly because of the task workload, not the server load, because the Pageview API is using Varnish as a caching layer and the majority of the requests do not reach the server and are served statically.

kaldari renamed this task from Mr.Z's popular pages is down and abandoned to [Epic] Fix and improve Mr.Z's popular pages report.Jan 26 2017, 2:39 AM
kaldari updated the task description. (Show Details)

http://tools.wmflabs.org/popularpages-dev/ is currently throwing me a 404 — is this known? And I imagine the URL will change when dev is complete and it's live?

http://tools.wmflabs.org/popularpages-dev/ is currently throwing me a 404 — is this known? And I imagine the URL will change when dev is complete and it's live?

No. The tool is a backend-only bot. It doesn't have a web-version. It processes data, generates the report and then dumps it on wiki pages. See the "Report" column links on https://en.wikipedia.org/wiki/User:Community_Tech_bot/Popular_pages to see the reports that have been generated.

On the project page, it is written "The bot currently only posts on English-language Wikipedia."

I don't think the scope of Community-Tech is English Wikipedia only. When this service will be available for all wikis?

On the project page, it is written "The bot currently only posts on English-language Wikipedia."

I don't think the scope of Community-Tech is English Wikipedia only. When this service will be available for all wikis?

Hi @Trizek-WMF. You're right that the scope of our team is not just English wikipedia. The wishlist wish was to Fix and improve Mr.Z-bot's popular pages report. This bot was only functional on the English wikipedia. This made sense at the time that bot was created because the idea is to generate reports for pages under wikiprojects. The complication with adding this feature for other wikis is that there is no standardized definition for how wikiprojects are created/used/evaluated which makes it hard to fetch their class & importance evaluations which we display in the reports. It is also hard to correctly associate pages with wikiprojects in many cases because of complexities with sub-projects/taskforces/equivalents that other wikis have come up with etc.

We are definitely thinking of enabling the bot on other wikis but it'd need some amount of work to understand their wikiproject world and enable PageAssessments extension on these wikis so we can fetch information about what pages are in which wikiprojects and what's their assessment status. We have already done that for a couple of extensions and we can start running the bot on those soon, once we are assured the bot is stable and capable of working with very large wikiprojects.

Thank you for jour reply, @Niharika.

19 wikis are using assessments. Most of them have copied the assessments system from English Wikipedia which is the reference. There is minor changes between the wikis, but the structure Wikiproject+Letter is common between them.

Contact those communities and introduce how this new system works and what they can change to have it is not worthless IMO, especially if the Bot maintenance is guaranteed. :)