Auditing the Face++ face classifier for intersectional biases¶

gendershades.org logo

this tutorial is based on the research work of Joy Buolamwini and Timnit Gebru, please read more at gendershades.org, as well as Kimmo Karkkainen and Jungseock Joo, read more here.

Introduction¶

Accuracy is one of the standard measures of performance in classification and it tells us the fraction of times that the model is correct. Most commonly, it's reported as just one number but this approach hides a number of possible problems.

For example, let's assume we have a model with a reported accuracy of 90%. That might sound great, but what what if the accuracy varies a lot between different groups of users? Would we still be happy with that model if that 90% is distributed unequally among genders with a 99% accuracy for male users and 85% of accuracy for female users and 5% accuracy for users of other genders? What if accuracy is not evenly distributed among users of different races?

When a model like this is used in for example in university admissions, hiring, healthcare, its errors will have an outsized negative influence on the opportunities of non-male or non-white users.

Further, testing for one attribute at a time, might further hide certain inequities that only become apparent when we look at the intersections of user attributes. For example 90% of accuracy for female users as a group might be composed of 100% accuracy of white female users and only 80% accuracy for Black female users. Again, a model like this used in a life-opportunity context might further exacerbate the societal inequities particularly for the individuals in certain intersections of attributes.

The problem at hand¶

In this tutorial we will perform an intersection audit of Face++, an online API for extracting information from faces. In particular we will measure how well it can estimate a binary gender from pictures of individuals that have different genders, races, and ages.

Binary gender view does not capture the complexities of gender identity and presentation, but the commercial products only offer binary classification and the available benchmark datasets only feature male/female lables. We will use those here for a straighforward measure of the model's performance.

To that end:

Part 1: We will obtain a benchmark dataset with thousands of faces annotated with their age, gender, and race.
Part 2: We will write Python code to interact with the Face++ API and classify each picture
Part 3: We will combine the model estimates with the ground truth provided by the benchmark creators
Part 4: We will measure accuracy for different genders and races, and then the intersections of gender and race

Obtaining the benchmark dataset¶

The benchmark used for the original study is only available by request, so let's use the FairFace dataset, which is more readily available from https://github.com/joojs/fairface.

Make sure you download the dataset with padding=1.25 and that you also get the validation labels.

Querying Face++ API to classify pictures¶

Open an account here: https://www.faceplusplus.com/
In https://console.faceplusplus.com/ select Get API Key
Copy your api key and api secret and save them to variables in the notebook

API_KEY = ''
API_SECRET = ''

Try the API on Face++ webpage to get a feel of how it works: https://www.faceplusplus.com/face-detection/
The example Python code on their webpage is not particularly helpful, so I wrote my own function to interact with the API, take a look at the code (you might need to pip install requests if you haven't already!)

import requests
import json
import base64

BASE_URL = 'https://api-us.faceplusplus.com/facepp/v3/detect'

def detect(image_path):
    """
    This function uses HTTP POST to query the Face++ API 
    that detects and classifies faces in the given image
    
    Args:
        image_path: path to the image that we want to classify
        
    Returns:
        result: a dictionary containing the API response
    """
    
    
    with open(image_path, 'rb') as image_file:
        # we will send the contents if the image encoded in base64, 
        # see https://en.wikipedia.org/wiki/Base64
        image_base64 = base64.b64encode(image_file.read())
        
        data = {'api_key': API_KEY,
               'api_secret': API_SECRET,
               'image_base64': image_base64,
               'return_attributes': 'gender,age'} # play around with the webpage to 
                                                  # find out other possible attributes 
        
        api_response = requests.post(url = BASE_URL, data = data)
        return json.loads(api_response.content)

Let's try it with the image called 51.jpg:

A face

result = detect('dataset/51.jpg')
result

{'request_id': '1635632280,6b8d6233-e189-403c-8504-401a71afdaf7',
 'time_used': 289,
 'faces': [{'face_token': '202c63d1fc458cb59c110b78ee17a803',
   'face_rectangle': {'top': 163, 'left': 151, 'width': 160, 'height': 160},
   'attributes': {'gender': {'value': 'Male'}, 'age': {'value': 40}}}],
 'image_id': 'Mrdg72odo9MIZ60UZ7PkYA==',
 'face_num': 1}

print(f"{result['faces'][0]['attributes']['gender']['value']}, {result['faces'][0]['attributes']['age']['value']}")

Male, 40

Great, that worked, let's now query the API for each picture in the dataset. Because we're using the free API, we're limited to one query per second, so we will have to sleep between calls.

The more pictures you classify, the more accurate measurement you will get but feel free to stop the loop after a few hundred (or after ~10 minutes) by pressing the square Stop button.

import os
import time
# The tqdm library gives us a progress bar visualization. 
# You might need to pip install tqdm if you haven't already.
from tqdm import tqdm

results = pd.DataFrame({'file': [], 'est_gender': [], 'est_age': []})

for fname in tqdm(sorted(os.listdir('dataset'))):
    
    if fname in results['file'].values:
        # if we already classified this file let's not do it again
        continue
    
    result = detect(f'dataset/{fname}')
    
    if len(result['faces']) == 1:
        # if there's exactly one face, save the result
        gender = result['faces'][0]['attributes']['gender']['value']
        age = result['faces'][0]['attributes']['age']['value']
        results.loc[len(results.index)] = {'file': fname, 'est_gender': gender, 'est_age': age}
    
    # wait one second so that we don't exceed the API quouta
    time.sleep(1)

100%|██████████| 10954/10954 [3:55:29<00:00,  1.29s/it]

Combining the API responses with the ground truth data¶

Now that Face++ classified all our images, we need to measure how often the estimates were correct. The first step will be to combine the two dataframes.

# read the ground truth dataset
labels = pd.read_csv('fairface_label_val.csv')
labels.head()

# let's make sure that the filenames match between our two data frames
labels['file'] = labels['file'].map(lambda x: x.split('/')[1])
labels.head()

# now we can combine them
combined = labels.set_index('file').join(results.set_index('file'), how='inner')
combined.head()

Measuring accuracy¶

Remember, accuracy is the fraction of answers that are correct, i.e. the count of correct answers divided by the total count of answers.

We're going to be comparing the model's guess, est_gender, to the groud truth, gender.

def accuracy(estimate, truth):
    """
    Calculates accuracy by comparing the estimate to the truth
    """
    return (estimate == truth).sum()/len(estimate)

Overall accuracy¶

Let's start by measuring the overall performance of the model.

acc = accuracy(combined['est_gender'], combined['gender'])
acc

0.864920744314266

That's 86.5% accuracy.

Accuracy for different genders¶

Sounds good, let's now test how well it performs for men and women separately.

for gender in combined['gender'].unique():
    # pick only the rows that correspond to images of people of one gender
    subset = combined.loc[combined['gender']==gender]
    acc = accuracy(subset['est_gender'], subset['gender'])
    print(f"{gender}, {acc*100:.1f}% accuracy")

Male, 87.4% accuracy
Female, 85.5% accuracy

We see here that the model is slightly less accurate for women than for men, i.e. it more often misgenders women.

Accuracy for different races¶

Let's now measure the accuracy for each race separately.

for race in combined['race'].unique():
    # pick only the rows that correspond to images of people of one race
    subset = combined.loc[combined['race']==race]
    acc = accuracy(subset['est_gender'], subset['gender'])
    print(f"{race}, {acc*100:.1f}% accuracy")

East Asian, 86.6% accuracy
Southeast Asian, 87.0% accuracy
Latino_Hispanic, 89.6% accuracy
Indian, 85.0% accuracy
Black, 78.2% accuracy
White, 89.1% accuracy
Middle Eastern, 89.3% accuracy

Here we observe bigger differences than with gender. Most notably the performance for Black individuals is over than 10 percentage points worse than it is for the groups with the highest accuracy.

Accuracy for intersections of race and gender¶

Finally, let's measure the accuracy for groups at different intersections of race and gender.

for race in combined['race'].unique():
    for gender in combined['gender'].unique():

        # pick only the rows that correspond to images of people 
        # of one intersection of race and gender
        subset = combined.loc[(combined['race']==race) & (combined['gender']==gender)]
        acc = accuracy(subset['est_gender'], subset['gender'])
        print(f"{gender}, {race}, {acc*100:.1f}% accuracy")

Male, East Asian, 83.8% accuracy
Female, East Asian, 89.5% accuracy
Male, Southeast Asian, 85.8% accuracy
Female, Southeast Asian, 88.3% accuracy
Male, Latino_Hispanic, 90.4% accuracy
Female, Latino_Hispanic, 88.7% accuracy
Male, Indian, 89.4% accuracy
Female, Indian, 80.7% accuracy
Male, Black, 78.1% accuracy
Female, Black, 78.4% accuracy
Male, White, 90.9% accuracy
Female, White, 86.8% accuracy
Male, Middle Eastern, 91.4% accuracy
Female, Middle Eastern, 85.1% accuracy

When inspecting intersections we see that between-group differences in accuracy are even higher than when looking at gender and race separately.

Even though the overall difference in accuracy between genders is only 2 percentage points, for Indian individuals the gap is much higher.

Overall the biggest gap in this test is between Black male individuals (78.1% accuracy) and Middle Eastern male individuals (91.4%).

Note that these results are not exactly the same as those found by the original audit. Because of a different benchmark:

We have a different sample size
We don't use skin shade on a six point scale and instead use the declared ethnicity
We performed the study at a different time, so the Facepp algorithm might have been re-trained on more data since.

Still, we observe large differences in performance between different groups of individuals.

	file	age	gender	race	service_test
0	val/1.jpg	3-9	Male	East Asian	False
1	val/2.jpg	50-59	Female	East Asian	True
2	val/3.jpg	30-39	Male	White	True
3	val/4.jpg	20-29	Female	Latino_Hispanic	True
4	val/5.jpg	20-29	Male	Southeast Asian	False

	file	age	gender	race	service_test
0	1.jpg	3-9	Male	East Asian	False
1	2.jpg	50-59	Female	East Asian	True
2	3.jpg	30-39	Male	White	True
3	4.jpg	20-29	Female	Latino_Hispanic	True
4	5.jpg	20-29	Male	Southeast Asian	False