this tutorial is based on the research work of Joy Buolamwini and Timnit Gebru, please read more at gendershades.org, as well as Kimmo Karkkainen and Jungseock Joo, read more here.
Accuracy is one of the standard measures of performance in classification and it tells us the fraction of times that the model is correct. Most commonly, it's reported as just one number but this approach hides a number of possible problems.
For example, let's assume we have a model with a reported accuracy of 90%. That might sound great, but what what if the accuracy varies a lot between different groups of users? Would we still be happy with that model if that 90% is distributed unequally among genders with a 99% accuracy for male users and 85% of accuracy for female users and 5% accuracy for users of other genders? What if accuracy is not evenly distributed among users of different races?
When a model like this is used in for example in university admissions, hiring, healthcare, its errors will have an outsized negative influence on the opportunities of non-male or non-white users.
Further, testing for one attribute at a time, might further hide certain inequities that only become apparent when we look at the intersections of user attributes. For example 90% of accuracy for female users as a group might be composed of 100% accuracy of white female users and only 80% accuracy for Black female users. Again, a model like this used in a life-opportunity context might further exacerbate the societal inequities particularly for the individuals in certain intersections of attributes.
In this tutorial we will perform an intersection audit of Face++, an online API for extracting information from faces. In particular we will measure how well it can estimate a binary gender from pictures of individuals that have different genders, races, and ages.
Binary gender view does not capture the complexities of gender identity and presentation, but the commercial products only offer binary classification and the available benchmark datasets only feature male/female lables. We will use those here for a straighforward measure of the model's performance.
To that end:
The benchmark used for the original study is only available by request, so let's use the FairFace dataset, which is more readily available from https://github.com/joojs/fairface.
Make sure you download the dataset with padding=1.25
and that you also get the validation labels.
API_KEY = ''
API_SECRET = ''
pip install requests
if you haven't already!)import requests
import json
import base64
BASE_URL = 'https://api-us.faceplusplus.com/facepp/v3/detect'
def detect(image_path):
"""
This function uses HTTP POST to query the Face++ API
that detects and classifies faces in the given image
Args:
image_path: path to the image that we want to classify
Returns:
result: a dictionary containing the API response
"""
with open(image_path, 'rb') as image_file:
# we will send the contents if the image encoded in base64,
# see https://en.wikipedia.org/wiki/Base64
image_base64 = base64.b64encode(image_file.read())
data = {'api_key': API_KEY,
'api_secret': API_SECRET,
'image_base64': image_base64,
'return_attributes': 'gender,age'} # play around with the webpage to
# find out other possible attributes
api_response = requests.post(url = BASE_URL, data = data)
return json.loads(api_response.content)
Let's try it with the image called 51.jpg
:
result = detect('dataset/51.jpg')
result
print(f"{result['faces'][0]['attributes']['gender']['value']}, {result['faces'][0]['attributes']['age']['value']}")
Great, that worked, let's now query the API for each picture in the dataset.
Because we're using the free API, we're limited to one query per second, so we will have to sleep
between calls.
The more pictures you classify, the more accurate measurement you will get but feel free to stop the loop after a few hundred (or after ~10 minutes) by pressing the square Stop button.
import os
import time
# The tqdm library gives us a progress bar visualization.
# You might need to pip install tqdm if you haven't already.
from tqdm import tqdm
results = pd.DataFrame({'file': [], 'est_gender': [], 'est_age': []})
for fname in tqdm(sorted(os.listdir('dataset'))):
if fname in results['file'].values:
# if we already classified this file let's not do it again
continue
result = detect(f'dataset/{fname}')
if len(result['faces']) == 1:
# if there's exactly one face, save the result
gender = result['faces'][0]['attributes']['gender']['value']
age = result['faces'][0]['attributes']['age']['value']
results.loc[len(results.index)] = {'file': fname, 'est_gender': gender, 'est_age': age}
# wait one second so that we don't exceed the API quouta
time.sleep(1)
Now that Face++ classified all our images, we need to measure how often the estimates were correct. The first step will be to combine the two dataframes.
# read the ground truth dataset
labels = pd.read_csv('fairface_label_val.csv')
labels.head()
# let's make sure that the filenames match between our two data frames
labels['file'] = labels['file'].map(lambda x: x.split('/')[1])
labels.head()
# now we can combine them
combined = labels.set_index('file').join(results.set_index('file'), how='inner')
combined.head()
Remember, accuracy is the fraction of answers that are correct, i.e. the count of correct answers divided by the total count of answers.
We're going to be comparing the model's guess, est_gender
, to the groud truth, gender
.
def accuracy(estimate, truth):
"""
Calculates accuracy by comparing the estimate to the truth
"""
return (estimate == truth).sum()/len(estimate)
Let's start by measuring the overall performance of the model.
acc = accuracy(combined['est_gender'], combined['gender'])
acc
That's 86.5% accuracy.
Sounds good, let's now test how well it performs for men and women separately.
for gender in combined['gender'].unique():
# pick only the rows that correspond to images of people of one gender
subset = combined.loc[combined['gender']==gender]
acc = accuracy(subset['est_gender'], subset['gender'])
print(f"{gender}, {acc*100:.1f}% accuracy")
We see here that the model is slightly less accurate for women than for men, i.e. it more often misgenders women.
Let's now measure the accuracy for each race separately.
for race in combined['race'].unique():
# pick only the rows that correspond to images of people of one race
subset = combined.loc[combined['race']==race]
acc = accuracy(subset['est_gender'], subset['gender'])
print(f"{race}, {acc*100:.1f}% accuracy")
Here we observe bigger differences than with gender. Most notably the performance for Black individuals is over than 10 percentage points worse than it is for the groups with the highest accuracy.
Finally, let's measure the accuracy for groups at different intersections of race and gender.
for race in combined['race'].unique():
for gender in combined['gender'].unique():
# pick only the rows that correspond to images of people
# of one intersection of race and gender
subset = combined.loc[(combined['race']==race) & (combined['gender']==gender)]
acc = accuracy(subset['est_gender'], subset['gender'])
print(f"{gender}, {race}, {acc*100:.1f}% accuracy")
When inspecting intersections we see that between-group differences in accuracy are even higher than when looking at gender and race separately.
Even though the overall difference in accuracy between genders is only 2 percentage points, for Indian individuals the gap is much higher.
Overall the biggest gap in this test is between Black male individuals (78.1% accuracy) and Middle Eastern male individuals (91.4%).
Note that these results are not exactly the same as those found by the original audit. Because of a different benchmark:
Still, we observe large differences in performance between different groups of individuals.