User:Multichill/Using OpenCV to categorize files

From Wikimedia Commons, the free media repository
Jump to navigation Jump to search

At the time of writing Commons contains about 150.000 uncategorized files. This is only about 1,25% of all files, but it's always nice to be able to lower the number even further. A lot of categorization work has already been done by the CategorizationBot, but this work is all done based on usage of a file. No categorization has been done based on the contents of the file itself.

OpenCV (Open Source Computer Vision) is a library of programming functions for real time computer vision. It can be used to "recognize" images. OpenCV could be used to move uncategorized files to one of the unidentified topics categories based on the image characteristics. OpenCV contains several approaches we could use to "recognize" images:

Some frequently occurring subjects in uncategorized files:

I installed OpenCV as explained here:

  • I already had Python2.7 installed
  • Installed the Python eggs of NumPy and SciPy
  • Downloaded and installed the (rather large) Windows package
  • Copied the contents of "C:\opencv\build\python\2.7" to "C:\Python27\Lib\site-packages"

In the C:\opencv\samples\ directory there are two folders with example python programs. Fun and useful to play around with!

The first test is to use a already ready classifier to do face detection in combination with Pywikipedia to fill Category:Unidentified people (bot tagged). The first results look promising. I see a lot of faces, but also some false positives. Next step is probably to start training some filters based on Commons images.

Look also at User:DrTrigonBot since it has similar python code.