Page MenuHomePhabricator

Wikistats data discrepancy for India page views from hive data pull
Closed, InvalidPublic

Description

We used this query to get pageview data in India by project and by access type, for the last three years:

SELECT 
  project,
  year, 
  access_method,
  SUM(view_count) as am_total
FROM wmf.pageview_hourly
  WHERE (year >= 2017)
  AND agent_type='user'
  AND country_code = 'IN'
  AND page_id != 0
  AND project IN ("bn.wikipedia", "hi.wikipedia", "ml.wikipedia", 
                  "pa.wikipedia", "ta.wikipedia", "te.wikipedia", "as.wikipedia", 
                  "sa.wikipedia", "kn.wikipedia", "tcy.wikipedia", "gu.wikipedia", 
                  "bn.wikipedia", "mr.wikipedia", "sat.wikipedia", "ur.wikipedia", 
                  "or.wikipedia", "en.wikipedia")
GROUP BY 
  project, year, access_method

The output doesn't align with data previously gathered on wikistats for the same time period.
See the wikistats query:
https://stats.wikimedia.org/v2/#/mr.wikipedia.org/reading/total-page-views/normal|bar|2-year|access~mobile-app|monthly

For example,
via the hive query in Jupyter Lab we see: 14,154 Marathi pageviews on mobile in 2019
via wikistats we see 67k Marathi pageviews on mobile in Sept 2019

We tried again with a simplified query and found a closer match though not yet matching numbers.

SELECT 
  project,
  year, 
  access_method,
  SUM(view_count) as am_total
FROM wmf.pageview_hourly
  WHERE year = 2019
  AND project = "mr.wikipedia"
GROUP BY 
  project, year, access_method

query results:
project year access_method am_total
mr.wikipedia 2019 desktop 27013186
mr.wikipedia 2019 mobile web 75567749
mr.wikipedia 2019 mobile app 509134

Event Timeline

Hi @Iflorez,
Data presented in Wikistats comes from AQS endpoints (see here for a list of pageviews related endpoints and their description).
The wikistats URL you have pasted above calls https://wikimedia.org/api/rest_v1/metrics/pageviews/aggregate/mr.wikipedia.org/mobile-app/all-agents/monthly/2017100100/2019110700 to retrieve data.

I have compared data from the above to both spark and turnilo, and got exactly the same values.

In turnilo: Here is the link (I prefer not have the awefull turnilo URLs visible in messages). Beware, turnilo rounds numbers in UI, please download CSV to get precise numbers.

In spark (the query is really big - more than 110K data blocks to read):

spark.sql("SELECT CONCAT(year, '-', LPAD(month, 2, 0)) as dt, SUM(view_count) as views from wmf.pageview_hourly where ((year = 2017 and month > 9) OR (year= 2018) OR (year = 2019 AND month < 11)) AND project = 'mr.wikipedia' and access_method = 'mobile app' group by CONCAT(year, '-', LPAD(month, 2, 0)) order by dt").show(100, false)

I think that, on the first query, differences come from the fact you are filtering by country of origin. On the second simpler one, the difference probably comes from including beginning of November numbers.

I'm closing the task as invalid, please reopen if needed :)

This is helpful clarity as we work with various data systems.
I will read the AQS endpoints documentation.
Thank you!

@PDas This is the documentation for Wikistats 2: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Wikistats_2
I will also be reading through this again.

Wikistats is a limited UI over a more capable API (AQS, the analytics query service) so what wikistats offers and AQS can do are different things. Data sources are the same so I would really focus on data exploration in hive/jupyter where you have capabilities to filter by country which are not present in AQS or wikistats.

cc Product-Analytics