Open Source for Open Knowledge
Analyzing the Wikipedia clickstream just got easier with WikiNav
We have recently developed WikiNav, an interactive tool to analyze and visualize reader navigation, as part of an Outreachy-internship.
Continue reading…
September 17, 2021
Muniza A, Isaac Johnson and Martin Gerlach
The Wikipedia image/caption matching challenge and a huge release of image data for research!
Wikipedia articles are missing images, and Wikipedia images are missing captions. A scientific competition organized by the Research team at the Wikimedia Foundation could help bridge this gap. The WMF is also releasing a large image dataset to help researchers and practitioners build systems for automatic image-text retrieval in the context of Wikipedia.
Continue reading…
September 9, 2021
Miriam Redi, Fabian Kaelin and Tiziano Piccardi
Searching for Wikipedia
How people use Search to access Wikipedia is a common question by researchers. Until now, however, there has been little data available about this relationship. To help address these questions, the Wikimedia Foundation is releasing a new, faceted dataset on search engine traffic to Wikipedia so you can ask questions like “What is the most common search engine in my country?” or “Which search engine is most-used by Android users?”
Continue reading…
June 7, 2021
Dan Andreescu, Kinneret Gordon, Isaac Johnson and Nicholas Perry
Upgrading Hadoop in just one day
The Wikimedia Analytics Engineering team manages multiple systems, all gravitating around a big (for our standards) Hadoop cluster. This post describes our path to changing our Hadoop distribution in a single day, together with the lessons learned while doing it.
Continue reading…
May 7, 2021
Luca Toscano and Joseph Allemandou
Censorship, outages and Internet shutdowns: monitoring Wikipedia’s accessibility around the world
This article describes the methodology used by the Wikimedia Foundation to monitor outages on Wikipedia around the world. These events are called anomalies and could be due to various causes, among them censorship.
Continue reading…
January 15, 2021
Nuria Ruiz, Marcel R. Forns, Diego Saez and Sukhbir Singh
Bot or Not? Identifying “fake” traffic on Wikipedia
We have been working this past year to better identify and tag the “bot spam” traffic so we can produce top pageview lists that (mostly) do not require manual curation.
Continue reading…
October 5, 2020
Nuria Ruiz, Joseph Allemandou, Leila Zia and MusikAnimal
MediaWiki History: the best dataset on Wikimedia content and contributors
Learn about using the Mediawiki History Dataset to explore the every day experience of editors on Wikipedia.
Continue reading…
October 1, 2020
Marcel Forns, Joseph Allemandou, Dan Andreescu and Nuria Ruiz
Wikimedia’s Event Data Platform – Event Intake
Part 3 of 3 posts on Wikimedia’s event data platform.
Continue reading…
September 24, 2020Andrew Otto
Wikimedia’s Event Data Platform – JSON & Event Schemas
In the previous post, we talked about why Wikimedia chose JSONSchema instead of Avro for our Event Data Platform. This post will discuss the conventions we adopted and the tooling we built to support an Event Data Platform using JSON and JSONSchema.
Continue reading…
September 17, 2020Andrew Otto
Wikimedia’s Event Data Platform, or JSON is ok too
The Wikimeda Foundation has been working with event data since 2012. Over time, our event collection systems have transitioned from being used only to collect analytics data to being used to build important user facing features. This 3 part series will focus on how Wikimedia has adapted these ideas for our own unique technical environment.
Continue reading…
September 10, 2020Andrew Otto
Recent Posts
Pawing around with PAWS: recent updates to Wikimedia Cloud Services’ Jupyter notebooks instance
Wikipedia and Apps: A Love Story
How we learned to stop worrying and loved the (event) flow
The trouble with triples
Getting the WDQS Updater to production: a tale of production readiness for Flink on Kubernetes at WMF
December 2021
November 2021
October 2021
September 2021
August 2021
July 2021
June 2021
May 2021
March 2021
February 2021
January 2021
December 2020
November 2020
October 2020
September 2020
August 2020
July 2020
June 2020
May 2020
April 2020
March 2020
February 2020
December 2019
October 2019
September 2019
June 2019
May 2019
April 2019
March 2019
February 2019
January 2019
December 2018
October 2018
September 2018
June 2018
February 2018
January 2018
November 2017
July 2017
June 2017
December 2014
Cloud Services
Learning & How To
Machine Learning
Release Engineering
Research & Analytics
Search Platform
Site Reliability Engineering
Wikimedia Performance
Privacy Policy | About
Wikipedia® and other Wikimedia project names and logos are registered trademarks of the Wikimedia Foundation, a non-profit organization.
Unless otherwise stated content is licensed under a CC BY-SA 4.0 international license.
Powered by VIP, Automattic Privacy Notice.
Learn more about the
Wikimedia Foundation
Follow us on Twitter @wikimediatech