Page MenuHomePhabricator

Identifying controversial content in Wikidata
Closed, ResolvedPublic

Description

The aim of this project is to identify controversial content in Wikidata.

Specifically we will develop the following tasks:

  • Create and test different definitions of controversiality in Wikidata,
  • Develop a model to early identify controversial content.

Event Timeline

diego triaged this task as High priority.Aug 3 2021, 9:44 AM

As a very initial exploration, we analyzed a subset of Wikidata items, categorized them by topic, and checked which of them received more updates, as proxy for conroversiality.

More specifically,

  • We selected all the Wikidata items with sitelinks to enwiki.
  • We counted the number edits summaries containing the keyword:wbsetclaim-update
  • We found that claims related to Software and computing are the ones - proportionally - more updated within this subsabe

image.png (431×622 px, 126 KB)

  • We also found that most updated property is P31

image.png (407×164 px, 23 KB)

These last results are not normalized yet.

@Lydia_Pintscher , regarding your question about the number of users co-editing a Wikidata page, I found that for all edits to namespace 0, in July 2021, considering items that have at least one sitelink:

  • 84% of pages were edited just by one user.
  • 14% by two users , and the reminding 2% of pages, by more than 2 users.

Updates

  • I'm focusing on reverted revisions.
  • Developed a methodology to characterize Wikidata edits according different dimensions, such as the property edited, the edit type (from edit summaries), and user characteristics.

(popular edit types)

image.png (298×383 px, 39 KB)

  • Exploring the differences on reverts done/received by bots and humans.

Updates

  • I've been running analysis on the predictability of reverts on Wikidata, including page, user and edit characteristics such as the property and the action summary explained above.
  • Probably not surprising I've found that the user characteristics such as the "account age" (the difference between a given edit and the user account creation) is the most related with the revert probability:
  • I've also noted that bots are less likely than humans to reverted, and that edits in new articles (items) are more likely to be reverted.
  • I'm now analyzing a set of properties that showed some correlation with reverts:
    • P 9157
    • P 97
    • P 3602
    • P 646
    • P 3782
    • P 183
    • P 2860
    • P 7902
    • P 9339
    • P 2671
  • Next steps is to model interactions between users, and also analyze the usage of the "disputed by" qualifiers.

image.png (267×410 px, 10 KB)

Updates

No updates this week.

Updates

  • I've been crunching data to study the "disputed by" qualifier. The plan is to have some statistics on this and compare with the reverts behavior.

updates

I've created a page on meta about this project. In the following weeks I'll be uploading some of the analysis and main results there.

Updates

  • I've started gathering and organizing the different results, to write a first report.

Updates

We presented this work at the TTO'21 conference. We received interesting feedback, including questions about the definition of controversial content. Some potential collaboration for a second round on this research were opened.

Updates

  • Preliminary results presented to our stakeholder.
  • Next weeeks we will be focusing a deeper understanding of reverting behavior.

TODO

  • Update meta page (within the next 3 weeks)

Updates

No updates this week.

Updates

  • Working on modeling the reverting behavior.

Updates

  • I've been working on classifier to predict reverts.
    • The current classifier uses article (item), revision and user information.
    • On a balance test set, the actual model gets results over 70% of accuracy
    • However, there is a set of caveats to be considered:
      • 'auto-reverts': users can revert themselves, this shouldn't be consider as signal of controversy. We need to analyze more this behavior.
      • power-users: we need to take in account that a small set of users produces most of the edits and reverts, this behavior could affect our results. We are working on different sampling method to address this issue.
  • The meta page was updated with the results in Q1 and partial results in Q2.

Updates

  • No updates this week. I'm going to meet with the stakeholder next week.

Updates

  • We have seen that few items are edited by more than one user.
  • We are currently researching about the item and users characteristics related to collaborative work.

Updates

  • I'm focusing on modeling the relationship between topics and collaborations/controversies.
    • I'm working on graph representation of these components

Updates

  • I'm organizing the new results to be discussed with the stakeholder.

Updates

  • We are now focusing in understanding collaborations patterns: when/how more than user edits the same item in a given period of time.
    • We found that in Wikidata such collaborations are less frequent than in other Wikimedia projects.
    • We also found that items edited by more than one user are usually related to on going events (awards, deaths, releases)
  • I'll present some of these findings:
    • On research meeting (Tuesday) in March
    • And @Lydia_Pintscher will propose a date probably in April to present these results to the Wikidata folks.

Updates

  • I'm working in identifying collaborative edits on wikidata items not related to current events.

Updates

  • We meet with Lydia and discussed the current results.
  • We reviewed the results confirming that most co-edited items corresponds to on going events, even when we change the time window to be considered.
  • Now, I'll be studying the relevance/prevalence of anonymous edits on popular content.

Updates

  • I've presented the main results of this work during the Tuesday Research Sessions, slides can be find here.

Updates

  • I was comparing the results when adding anonymous edits, until now I haven't find major differences with the previous results. I'll continue working on this during the next week before my next meeting with Lydia.
diego updated the task description. (Show Details)

Updates

  • We finished this project, results can be found on Meta, the code and models could be found in Gitlab.
  • I'll discuss future work with @Lydia_Pintscher.