Photo Algorithms ID White Men Fine—Black Women, Not So Much

Study finds that facial-analysis services from Microsoft and IBM are significantly more accurate for men than women and for whites than blacks, adding to concerns about bias in artificial intelligence.
This image may contain Transportation and Vehicle
HOTLITTLEPOTATO

Facial recognition is becoming more pervasive in consumer products and law enforcement, backed by increasingly powerful machine-learning technology. But a test of commercial facial-analysis services from IBM and Microsoft raises concerns that the systems scrutinizing our features are significantly less accurate for people with black skin.

Researchers tested features of Microsoft and IBM’s face-analysis services that are supposed to identify the gender of people in photos. The companies’ algorithms proved near perfect at identifying the gender of men with lighter skin, but frequently erred when analyzing images of women with dark skin.

The skewed accuracy appears to be due to underrepresentation of darker skin tones in the training data used to create the face-analysis algorithms.

The disparity is the latest example in a growing collection of bloopers from AI systems that seem to have picked up societal biases around certain groups. Google’s photo-organizing service still censors the search terms “gorilla” and “monkey” after an incident nearly three years ago in which algorithms tagged black people as gorillas, for example. The question of how to ensure machine-learning systems deployed in consumer products, commercial systems, and government programs has become a major topic of discussion in the field of AI.

A 2016 report from Georgetown described wide, largely unregulated deployment of facial recognition by the FBI, as well local and state police forces, and evidence the systems in use were less accurate for African-Americans.

In the new study, researchers Joy Buolamwini of MIT’s Media Lab, and Timnit Gebru, a Stanford grad student currently working as a researcher at Microsoft, fed the facial-recognition systems 1,270 photos of parliamentarians from Europe and Africa. The photos were chosen to represent a broad spectrum of human skin tones, using a classification system from dermatology called the Fitzpatrick scale. The research will be presented at the FAT* conference on fairness, accountability, and transparency in algorithmic systems later this month.

The image collection was used to test commercial cloud services that look for faces in photos from Microsoft, IBM, and Face++, a division of Beijing-based startup Megvii. The researchers’ analysis focused on the gender detection feature of the three services.

All three services worked better on male faces than female faces, and on lighter faces than darker faces. All the companies’ services had particular trouble recognizing that photos of women with darker skin tones were in fact women.

When asked to analyze the lightest male faces in the image set, Microsoft’s service correctly identified them as men every time. IBM’s algorithms had an error rate of 0.3 percent.

When asked to analyze darker female faces, Microsoft’s service had an error rate of 21 percent. IBM and Mevii’s Face++ both had 35 percent error rates.

In a statement, Microsoft said it had taken steps to improve the accuracy of its facial-recognition technology, and
was investing in improving its training datasets. “We believe the fairness of AI technologies is a critical issue for the industry and one that Microsoft takes very seriously,” the statement said. The company declined to answer questions about whether its face-analysis service had previously been tested for performance on different skin tone groups.

An IBM spokesperson said the company will deploy a new version of its service later this month. The company incorporated the audit’s findings into a planned upgrade effort, and created its own dataset to test accuracy on different skin tones. An IBM white paper says tests using that new dataset found the improved gender-detection service has an error rate of 3.5 percent on darker female faces. That’s still worse than the 0.3 percent for lighter male faces, but one-tenth the error rate in the study. Megvii did not respond to a request for comment.

Services that offer machine-learning algorithms on demand have become a hot area of competition among large technology companies. Microsoft, IBM, Google, and Amazon pitch cloud services for tasks like parsing the meaning of images or text as a way for industries such as sports, healthcare, and manufacturing to tap artificial intelligence capabilities previously limited to tech companies. The flip side is that customers also buy into the limitations of those services, which may not be apparent.

One customer of Microsoft’s AI services, startup Pivothead, is working on smart glasses for visually impaired people. They use the cloud company’s vision services to have a synthetic voice describe the age and facial expression of people nearby.

A video for the project, made in collaboration with Microsoft, shows the glasses helping a man understand what’s around him as he walks down a London street with a white cane. At one point the glasses say “I think it’s a man jumping in the air doing a trick on a skateboard” when a young white man zips past. The audit of Microsoft’s vision services suggests such pronouncements could be less accurate if the rider had been black.

Technical documentation for Microsoft’s service says that gender detection, along with other attributes it reports for faces such as emotion and age, are “still experimental and may not be very accurate.”

DJ Patil, chief data scientist for the United States under President Obama, says the study’s findings highlight the need for tech companies to ensure their machine-learning systems work equally well for all types of people. He suggests purveyors of services like those tested should be more open about the limitations of the services they offer under the shiny banner of artificial intelligence. “Companies can slap on a label of machine learning or artificial intelligence, but you have no way to say what are the boundaries of how well this works,” he says. “We need that transparency of this is where it works, this is where it doesn’t.”

Buolamwini and Gebru’s paper argues that only disclosing a suite of accuracy numbers for different groups of people can truly give users a sense of the capabilities of image processing software used to scrutinize people. IBM’s forthcoming white paper on the changes being made to its face analysis service will include such information.

The researchers who forced that response also hope to enable others to perform their own audits of machine-learning systems. The collection of images they used to test the cloud services will be made available for other researchers to use.

Microsoft has made efforts to position itself as a leader in thinking about the ethics of machine learning. The company has many researchers working on the topic, and an internal ethics panel called Aether, for AI and Ethics in Engineering and Research. In 2017 it was involved in an audit that discovered Microsoft’s cloud service that analyzes facial expressions functioned poorly on children under a certain age. Investigation revealed shortcomings in the data used to train the algorithms, and the service was fixed.

Detecting Bias
  • Nearly three years after Google Photos labeled black people "gorillas," the service does not use "gorilla" as a label.
  • Prominent research-image collections display a predictable gender bias in their depiction of activities such as cooking and sports.
  • Artificial-intelligence researchers have begun to search for an ethical conscience in the field.