Thursday, November 12, 2009

Mapping the Geographies of Wikipedia Content

The Internet surrounds us like air, saturating our offices and our homes. But it’s not confined to the ether. You can touch it. You can map it. And you can photograph it - Andrew Blum 2009

The following maps represent the first stage of a project I am embarking on to map out some of the spatial contours of Wikipedia. Data were obtained from the August 2009 Wikipedia geodata dump organised by user Kolossos. The information was then ported over to a GIS. There are almost half a million geotagged Wikipedia articles (i.e. Wikipedia articles about a place or an event that occurred in a distinct place), so the preparation time alone for the files needed to create these maps was almost a week.

The map below displays the total number of Wikipedia articles tagged to each country. The country with the most articles is the United States (almost 90,000 articles). Anguilla has the fewest number of geotagged articles (4), and indeed most small island nations and city states have less than 100 articles. However, it is not just microstates that are characterised by extremely low levels of wiki representation. Almost all of Africa is poorly represented in Wikipedia. Remarkably there are more Wikipedia articles written about Antarctica than all but one of the fifty-three countries in Africa (or perhaps even more amazingly, there are more Wikipedia articles written about the fictional places of Middle Earth and Discworld than about many countries in Africa, the Americas and Asia).

When examining the data normalised by area, an entirely different pattern is evident. Central and Western Europe, Japan and Israel have the most articles per landmass, while large countries like Russia and Canada have low ratios of Wikipedia articles per area.

Finally, the data were also mapped out against population. Here countries with small populations and large landmasses rise to the top of the rankings. Canada, Australia and Greenland all have extremely high levels of articles per every 100,000 people. Smaller nations with many noteworthy features or geotaggable events also appear high in the rankings (e.g. Pitcairn or Iceland).


As I've previously argued, Wikipedia is an important component of the palimpsests of place. In other words, presences and absences play a fundamental role in shaping how we interpret and interact with the world. The fact that the geographies of Wikipedia content are so uneven therefore leads to worrying conclusions. As we increasingly rely on peer produced information, large parts of the world remain 'terra incognita' (in a similar manner to the ways in which many of those same places were represented on European maps before the 19th Century). However, it is conceivable that it will only be a matter of time until the empty spaces on the Wikipedia map are filled in by Wikipedians in Zambia, Indonesia, and much of the rest of world.

These data certainly warrant a closer look, and I'll aim to get more maps (examining the distribution of content in specific languages, and looking in more detail at specific regions) uploaded soon.

28 comments:

Anonymous said...

Interesting work, though after editing for several years I can't say I'm too surprised by the systematic bias. I look forward to seeing more.

Matt said...

This is really awesome work, Mark! I've been thinking about doing something similar with KML content...

Mark said...

Hi Matt. Thanks for the comment. I've actually also been working on mapping out placemarks over on the floatingsheep blog (http://www.floatingsheep.org/2009/06/global-placemark-intensity.html). Happy to talk about this more if you'd like.

Matt said...

Excellent. Thanks for the link, Mark... More soon.

Imaginative Display Name said...

Hi Matt
This is interesting stuff. I've just this weekend discovered earthscan Atlas series http://www.earthscan.co.uk/?tabid=37&st=basic&se=atlas

Sometime back I emailed one of the Wikinomics authors about the possibility of a three dimensional representation of Wikipedia (or any internet system, the WWW itself for that matter).
I had in mind something along the lines of star systems (or neural networks) with the most active sites as the biggest brightest nodes (or whatever you wish to call them).
Is there any such thing? If not, what kind of obstacles are we talking about? Any idea?

(I forget how I got the imaginative display name)

Nick

Mark said...

Hi Nick. Thanks for the link. Are you talking about something along the lines of these maps: http://en.wikipedia.org/wiki/File:Internet_map_1024.jpg

Imaginative Display Name said...

Yip!!

Imaginative Display Name said...

Was that image always at the top of this blog?

Anyway, that looks a lot like what I visualized, but live, active.

Imaginative Display Name said...

Oh dear... I'm all confused.. I meant to respond to Mark, not Matt (not that I have anything against Matt ;)

HaeB said...

I find the direct comparison with the density map of GeoNames entries quite visually compelling: http://en.wikipedia.org/wiki/Wikipedia:Systemic_bias#The_bias

Anonymous said...

Could you plot the data vs population density?

Mark Graham said...

Interesting idea. I'll have a look at that next week.

Kento Ikeda said...

Why does Burkina Faso have so many

nihiltres said...

I'd really like to see this data plotted against number of internet users per country. I don't see the results as a huge surprise when compared to, say, the Internet penetration map on Commons. The discrepancies with Internet penetration are more interesting.

I'd expect to see a general effect of concentration or dispersion at each extreme: a certain critical mass per country will probably result in more geotags per user, and under that mass we'll probably see very little geotagging. Above a certain mass, there will probably be a drop-off in growth, probably limited by total country area and/or population.

Adam Villani said...

It'd be nice to see insets of the countries too small to show up on the map.

Popo le Chien said...
This comment has been removed by the author.
Popo le Chien said...

(Redo) Backtrack here.

My understanding is that you only used geotags from the english-speaking Wikipedia. Is that right?

If so, it may be interesting to see whether all wikis have the same patterns of geotagging.

Mark Graham said...

@Kento: I discuss the reason for Burkina Faso having so many tags here: http://www.guardian.co.uk/technology/2009/dec/02/wikipedia-known-unknowns-geotagging-knowledge

@nihiltres: This is actually something I've been working on. Hope to have some concrete results too.

@Adam, I'll also try to get to this soon.

@Popo. No this isn't just the English Wikipedia. The data include all geotags in any language.

Popo le Chien said...

Ooops. I just realized it's written in big nice letters at the bottom of each pic :-b

mikk said...

These maps have a note "Metadata and more maps available at geospace.co.uk". Where are they exactly? I cant find...

Franck Marchis said...

well if you compare with the map of the density of inhabitants you will see the same maxima in Europe and US. Only China is a surprise to me, but I remember that wikipedia was blocked in China for years (it may still be). It is logical that people talk and write about their environment, their country, their history since it is what they know the best.
This blog post is almost blaming wikipedia to be focused on a few countries. Since wikipedia is written by the people for the people it is logical that the maximal density area get the best coverage. The solution will be to promote higher education, and internet access to poorer countries but this is beyond the objective of the free encyclopedia. If I want to know more about a country in Africa, I will read articles available in the press or will visit my library, but I will not blame the people from the US for not writing anything about Africa in wikipedia.

Anonymous said...

You should take into account the percentage of English speaking people in the countries represented, as well. That is if you are only counting the English Wikipedia. The figures might be different if you take into account other language Wikipedias.

Mark Graham said...

@mikk: I'm afraid I haven't had the time to upload more maps or data just yet. I'll try to get to it later this week though.

@Franck: The point of this blog is not to blame Wikipedia or its editors. My point with these maps is to highlight some of the gaps in knowledge that we can: (1) work on filling in; and (2) be aware of when using Wikipedia as a resource.

@Anonymous: These maps show the results from all languages. Not just English.

Arun Viswanathan aka n30bli7z said...

Your data is interesting but was not exactly surprising to me. I think many countries in Africa or other poorer nations do not have easy access to technology or the awareness to tell their story on Wikipedia.Thus they are underrepresented in the number of articles. Also, they do not get many visitors who will be compelled to write an entry for them.

Apart from that I have 2 suggestions:
1. Can you plot the country of origin of the authors of the entries (by geoconverting their IPs) on a heat map? I suspect that this closely resemble your first map or maybe not?

2. Can you then plot the geographical distribution of authors for entries corresponding to a particular region? This data should be interesting. One would suspect the the authors editing articles about a region would be from that region.

Tobi F. said...

This is quite interesting, but I wonder how this compares to established reference works such as Britannica or World Book. Would they map the same?

Anonymous said...

You need think about it. Despite the emails, the overwhelming evidence showing global warming is happening hasn't changed.
"The e-mails do nothing to undermine the very strong scientific consensus . . . that tells us the Earth is warming, that warming is largely a result of human activity," Jane Lubchenco, who heads the National Oceanic and Atmospheric Administration, told a House committee. She said that the e-mails don't cover data from NOAA and NASA, whose independent climate records show dramatic warming.

bathmate said...

good posting.i like it. thank u. :)-


bathmate

Patrick John Collins said...

It's really a reflection of internet penetration - countries with cheaper / faster internet connections have more wikipedia articles.