Identifying controversial content in Wikidata
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	diego
	Aug 3 2021, 9:44 AM

Description

The aim of this project is to identify controversial content in Wikidata.

Specifically we will develop the following tasks:

Create and test different definitions of controversiality in Wikidata,
Develop a model to early identify controversial content.

Results can be found here: https://meta.wikimedia.org/wiki/Research:Identifying_Controversial_Content_in_Wikidata
Code and models here: https://gitlab.wikimedia.org/repos/research/controveriesWikidata

Related Objects

Mentioned In: T204437: better understanding of conflicts around data
Mentioned Here: P31 Fork of P29 (An Untitled Masterwork)

Event Timeline

diego created this task.Aug 3 2021, 9:44 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 3 2021, 9:44 AM

diego triaged this task as High priority.Aug 3 2021, 9:44 AM

As a very initial exploration, we analyzed a subset of Wikidata items, categorized them by topic, and checked which of them received more updates, as proxy for conroversiality.

More specifically,

We selected all the Wikidata items with sitelinks to enwiki.
We counted the number edits summaries containing the keyword:wbsetclaim-update
We found that claims related to Software and computing are the ones - proportionally - more updated within this subsabe

We also found that most updated property is P31

These last results are not normalized yet.

@Lydia_Pintscher , regarding your question about the number of users co-editing a Wikidata page, I found that for all edits to namespace 0, in July 2021, considering items that have at least one sitelink:

84% of pages were edited just by one user.
14% by two users , and the reminding 2% of pages, by more than 2 users.

Thank you!

Manuel added a project: Data-Engineering-Wikistats.Aug 5 2021, 6:08 PM

Restricted Application added a project: Analytics. · View Herald TranscriptAug 5 2021, 6:08 PM

Manuel edited projects, added Wikidata Analytics; removed Data-Engineering-Wikistats.Aug 5 2021, 6:08 PM

Manuel removed a project: Analytics.Aug 5 2021, 6:11 PM

Lydia_Pintscher mentioned this in T204437: better understanding of conflicts around data.Aug 9 2021, 1:22 PM

No updates this week.

Updates

I'm focusing on reverted revisions.
Developed a methodology to characterize Wikidata edits according different dimensions, such as the property edited, the edit type (from edit summaries), and user characteristics.

(popular edit types)

Exploring the differences on reverts done/received by bots and humans.

Updates

I've been running analysis on the predictability of reverts on Wikidata, including page, user and edit characteristics such as the property and the action summary explained above.
Probably not surprising I've found that the user characteristics such as the "account age" (the difference between a given edit and the user account creation) is the most related with the revert probability:
I've also noted that bots are less likely than humans to reverted, and that edits in new articles (items) are more likely to be reverted.
I'm now analyzing a set of properties that showed some correlation with reverts:
- P 9157
- P 97
- P 3602
- P 646
- P 3782
- P 183
- P 2860
- P 7902
- P 9339
- P 2671

Next steps is to model interactions between users, and also analyze the usage of the "disputed by" qualifiers.

Updates

No updates this week.

Updates

I've been crunching data to study the "disputed by" qualifier. The plan is to have some statistics on this and compare with the reverts behavior.

updates

I've created a page on meta about this project. In the following weeks I'll be uploading some of the analysis and main results there.

Updates

I've started gathering and organizing the different results, to write a first report.

diego moved this task from FY2021-22-Research-July-Sept to FY2021-22-Research-Oct-Dec on the Research board.Oct 8 2021, 7:02 PM

diego edited projects, added Research (FY2021-22-Research-Oct-Dec); removed Research (FY2021-22-Research-July-Sept).

Updates

We presented this work at the TTO'21 conference. We received interesting feedback, including questions about the definition of controversial content. Some potential collaboration for a second round on this research were opened.

Updates

Preliminary results presented to our stakeholder.
Next weeeks we will be focusing a deeper understanding of reverting behavior.

TODO

Update meta page (within the next 3 weeks)

Updates

No updates this week.

Updates

Working on modeling the reverting behavior.

diego updated the task description. (Show Details)Nov 12 2021, 6:10 PM

Updates

I've been working on classifier to predict reverts.
- The current classifier uses article (item), revision and user information.
- On a balance test set, the actual model gets results over 70% of accuracy
- However, there is a set of caveats to be considered:
  - 'auto-reverts': users can revert themselves, this shouldn't be consider as signal of controversy. We need to analyze more this behavior.
  - power-users: we need to take in account that a small set of users produces most of the edits and reverts, this behavior could affect our results. We are working on different sampling method to address this issue.
The meta page was updated with the results in Q1 and partial results in Q2.

Updates

No updates this week. I'm going to meet with the stakeholder next week.

Manuel moved this task from Incoming to Needs PM work on the Wikidata Analytics board.Dec 8 2021, 12:16 PM

Updates

We have seen that few items are edited by more than one user.
We are currently researching about the item and users characteristics related to collaborative work.

Updates

I'm focusing on modeling the relationship between topics and collaborations/controversies.
- I'm working on graph representation of these components

Updates

I'm organizing the new results to be discussed with the stakeholder.

Updates

We are now focusing in understanding collaborations patterns: when/how more than user edits the same item in a given period of time.
- We found that in Wikidata such collaborations are less frequent than in other Wikimedia projects.
- We also found that items edited by more than one user are usually related to on going events (awards, deaths, releases)
I'll present some of these findings:
- On research meeting (Tuesday) in March
- And @Lydia_Pintscher will propose a date probably in April to present these results to the Wikidata folks.

Updates

No updates

Updates

I'm working in identifying collaborative edits on wikidata items not related to current events.

Updates

No updates

Updates

We meet with Lydia and discussed the current results.
We reviewed the results confirming that most co-edited items corresponds to on going events, even when we change the time window to be considered.
Now, I'll be studying the relevance/prevalence of anonymous edits on popular content.

Updates

I've presented the main results of this work during the Tuesday Research Sessions, slides can be find here.

Updates

I was comparing the results when adding anonymous edits, until now I haven't find major differences with the previous results. I'll continue working on this during the next week before my next meeting with Lydia.

leila moved this task from FY2021-22-Research-Jan-March to FY2021-22-Research-April-June on the Research board.Apr 8 2022, 2:35 AM

leila edited projects, added Research (FY2021-22-Research-April-June); removed Research (FY2021-22-Research-Jan-March).

diego closed this task as Resolved.Apr 8 2022, 4:17 PM

diego updated the task description. (Show Details)

Updates

We finished this project, results can be found on Meta, the code and models could be found in Gitlab.
I'll discuss future work with @Lydia_Pintscher.

	F34631288: image.png
	Sep 3 2021, 8:58 PM

	F34618187: image.png
	Aug 27 2021, 8:52 AM

	F34574587: image.png
	Aug 3 2021, 9:55 AM

	F34574595: image.png
	Aug 3 2021, 9:55 AM

Identifying controversial content in WikidataClosed, ResolvedPublicActions

Description

Related Objects

Event Timeline

Identifying controversial content in Wikidata
Closed, ResolvedPublic
Actions