Replicating Propbulica’s COMPAS Audit in TD STEM Academy 2021¶

Why COMPAS?¶

Propublica started the COMPAS Debate with the article Machine Bias. With their article, they also released details of their methodology and their data and code. This presents a real data set that can be used for research on how data is used in a criminal justice setting without researchers having to perform their own requests for information, so it has been used and reused a lot of times.

First, we need to import some common libraries,

import numpy as np
import pandas as pd
import scipy
import matplotlib.pyplot as plt
import seaborn as sns
import itertools
from sklearn.metrics import roc_curve
import warnings
warnings.filterwarnings('ignore')

Propublica COMPAS Data¶

The dataset consists of COMPAS scores assigned to defendants over two years 2013-2014 in Broward County, Florida, it was released by Propublica in a GitHub Repository. These scores are determined by a proprietary algorithm designed to evaluate a persons recidivism risk - the likelihood that they will reoffend. Risk scoring algorithms are widely used by judges to inform their sentencing and bail decisions in the criminal justice system in the United States. The original ProPublica analysis identified a number of fairness concerns around the use of COMPAS scores, including that ‘’black defendants were nearly twice as likely to be misclassified as higher risk compared to their white counterparts.’’ Please see the full article for further details. Use pandas to read in the data and set the id column to the index.

df_pp = pd.read_csv("https://github.com/propublica/compas-analysis/raw/master/compas-scores-two-years.csv",
                 header=0).set_index('id')

Look at the list of columns and the first few rows to get an idea of what the dataset looks like.

print(list(df_pp))
df_pp.head()

['name', 'first', 'last', 'compas_screening_date', 'sex', 'dob', 'age', 'age_cat', 'race', 'juv_fel_count', 'decile_score', 'juv_misd_count', 'juv_other_count', 'priors_count', 'days_b_screening_arrest', 'c_jail_in', 'c_jail_out', 'c_case_number', 'c_offense_date', 'c_arrest_date', 'c_days_from_compas', 'c_charge_degree', 'c_charge_desc', 'is_recid', 'r_case_number', 'r_charge_degree', 'r_days_from_arrest', 'r_offense_date', 'r_charge_desc', 'r_jail_in', 'r_jail_out', 'violent_recid', 'is_violent_recid', 'vr_case_number', 'vr_charge_degree', 'vr_offense_date', 'vr_charge_desc', 'type_of_assessment', 'decile_score.1', 'score_text', 'screening_date', 'v_type_of_assessment', 'v_decile_score', 'v_score_text', 'v_screening_date', 'in_custody', 'out_custody', 'priors_count.1', 'start', 'end', 'event', 'two_year_recid']

	name	first	last	compas_screening_date	sex	dob	age	age_cat	race	juv_fel_count	...	v_decile_score	v_score_text	v_screening_date	in_custody	out_custody	priors_count.1	start	end	event	two_year_recid
id
1	miguel hernandez	miguel	hernandez	2013-08-14	Male	1947-04-18	69	Greater than 45	Other	0	...	1	Low	2013-08-14	2014-07-07	2014-07-14	0	0	327	0	0
3	kevon dixon	kevon	dixon	2013-01-27	Male	1982-01-22	34	25 - 45	African-American	0	...	1	Low	2013-01-27	2013-01-26	2013-02-05	0	9	159	1	1
4	ed philo	ed	philo	2013-04-14	Male	1991-05-14	24	Less than 25	African-American	0	...	3	Low	2013-04-14	2013-06-16	2013-06-16	4	0	63	0	1
5	marcu brown	marcu	brown	2013-01-13	Male	1993-01-21	23	Less than 25	African-American	0	...	6	Medium	2013-01-13	NaN	NaN	1	0	1174	0	0
6	bouthy pierrelouis	bouthy	pierrelouis	2013-03-26	Male	1973-01-22	43	25 - 45	Other	0	...	1	Low	2013-03-26	NaN	NaN	2	0	1102	0	0

5 rows × 52 columns

Data Cleaning¶

For this analysis, we will restrict ourselves to only a few features, and clean the dataset according to the methods using in the original ProPublica analysis.

For this tutorial, we’ve prepared a cleaned copy of the data, that we can import directly.

df = pd.read_csv('https://raw.githubusercontent.com/ml4sts/outreach-compas/main/data/compas_c.csv')

Data Exploration¶

Next we provide a few ways to look at the relationships between the attributes in the dataset. Here is an explanation of these values:

age: defendant’s age
c_charge_degree: degree charged (Misdemeanor of Felony)
race: defendant’s race
age_cat: defendant’s age quantized in “less than 25”, “25-45”, or “over 45”
score_text: COMPAS score: ‘low’(1 to 5), ‘medium’ (5 to 7), and ‘high’ (8 to 10).
sex: defendant’s gender
priors_count: number of prior charges
days_b_screening_arrest: number of days between charge date and arrest where defendant was screened for compas score
decile_score: COMPAS score from 1 to 10 (low risk to high risk)
is_recid: if the defendant recidivized
two_year_recid: if the defendant within two years
c_jail_in: date defendant was imprisoned
c_jail_out: date defendant was released from jail
length_of_stay: length of jail stay

In particular, as in the ProPublica analysis, we are interested in the implications for the treatment of different groups as defined by some protected attribute. In particular we will consider race as the protected attribute in our analysis. Next we look at the number of entries for each race.

Use value_counts to look at how much data is available for each race and compare the original and clean versions

df['race'].value_counts()

African-American    3175
Caucasian           2103
Name: race, dtype: int64

COMPAS score distribution¶

Let’s look at the COMPAS score distribution between African-Americans and Caucasians (matches the one in the ProPublica article).

race_score_table = df.groupby(['race','decile_score']).size().reset_index().pivot(
                                index='decile_score',columns='race',values=0)

# percentage of defendants in each score category
(100*race_score_table/race_score_table.sum()).transpose()

decile_score	1	2	3	4	5	6	7	8	9	10
race
African-American	11.496063	10.897638	9.385827	10.614173	10.173228	10.015748	10.803150	9.480315	9.984252	7.149606
Caucasian	28.768426	15.263909	11.317166	11.554922	9.510223	7.608179	5.373276	4.564907	3.661436	2.377556

Next, make a bar plot with that table (quickest way is to use pandas plot with figsize=[12,7] to make it bigger, plot type is indicated by the kind parameter)

race_score_table.plot(kind='bar')

<AxesSubplot:xlabel='decile_score'>

As you can observe, there is a large discrepancy. Does this change when we condition on other variables?

Look at how priors are distributed. Follow what you did above for score by race (or continue for help)

priors = df.groupby(['race','priors_count']).size().reset_index().pivot(index='priors_count',columns='race',values=0)
priors.plot(kind='bar',figsize=[12,7])

<AxesSubplot:xlabel='priors_count'>

Look at how scores are distributed for those with more than two priors
(bonus) What about with less than two priors ?(you can copy or import again the above and modify it)
(bonus) Look at first time (use priors_count) felons (c_charge_degree of F) under 25. How is this different?

df_2priors = df.loc[df['priors_count']>=2]
score_2priors = df_2priors.groupby(['race','decile_score']).size().reset_index().pivot(
    index='decile_score',columns='race',values=0)
score_2priors.plot(kind='bar',figsize=[15,7])

<AxesSubplot:xlabel='decile_score'>

What happens when we take actual 2-year recidivism values into account? Are the predictions fair?¶

First, we’re going to load a different version of the data, it’s quantized. Then look at the correlation between the quantized score, the decile score and the actual recidivism.

dfQ = pd.read_csv('https://raw.githubusercontent.com/ml4sts/outreach-compas/main/data/compas_cq.csv')

Is the ground truth correlated to the high/low rating (score_text)?

# measure with high-low score
dfQ[['two_year_recid','score_text']].corr()

	two_year_recid	score_text
two_year_recid	1.000000	0.314698
score_text	0.314698	1.000000

Is the ground truth correlated to the decile_scorerating?

dfQ[['two_year_recid','decile_score']].corr()

	two_year_recid	decile_score
two_year_recid	1.000000	0.368193
decile_score	0.368193	1.000000

The correlation is not that high. How can we evaluate whether the predictions made by the COMPAS scores are fair, especially considering that they do not predict recidivism rates well?

Fairness Metrics¶

The question of how to determine if an algorithm is fair has seen much debate recently (see this tutorial from the Conference on Fairness, Acountability, and Transparency titled 21 Fairness Definitions and Their Politics.

And in fact some of the definitions are contradictory, and have been shown to be mutually exclusive [2,3] https://www.propublica.org/article/bias-in-criminal-risk-scores-is-mathematically-inevitable-researchers-say

Here we will cover 3 notions of fairness and present ways to measure them:

Disparate Impact 4 The 80% rule
Calibration 6
Equalized Odds 5

For the rest of our analysis we will use a binary outcome - COMPAS score <= 4 is LOW RISK, >4 is HIGH RISK.

Disparate Impact¶

Disparate impact is a legal concept used to describe situations when an entity such as an employer inadvertently discriminates gainst a certain protected group. This is distinct from disparate treatment where discrimination is intentional.

To demonstrate cases of disparate impact, the Equal Opportunity Commission (EEOC) proposed “rule of thumb” is known as the The 80% rule.

Feldman et al. 4 adapted a fairness metric from this principle. For our application, it states that the percent of defendants predicted to be high risk in each protected group (in this case whites and African-Americans) should be within 80% of each other.

Let’s evaluate this standard for the COMPAS data.

#  Let's measure the disparate impact according to the EEOC rule
means_score = dfQ.groupby(['score_text','race']).size().unstack().reset_index()
means_score = means_score/means_score.sum()
means_score
# split this cell for the above to print
# compute disparate impact
AA_with_high_score = means_score.loc[1,'African-American']
C_with_high_score = means_score.loc[1,'Caucasian']

C_with_high_score/AA_with_high_score

0.5745131730114521

This ratio is below .8, so there is disparate impact by this rule. (Taking the priveleged group and the undesirable outcome instead of the disadvantaged group and the favorable outcome).

What if we apply the same rule to the true two year rearrest instead of the quantized COMPAS score?

means_2yr = dfQ.groupby(['two_year_recid','race']).size().unstack()
means_2yr = means_2yr/means_2yr.sum()
means_2yr

# compute disparte impact
AA_with_high_score = means_2yr.loc[1,'African-American']
C_with_high_score = means_2yr.loc[1,'Caucasian']
C_with_high_score/AA_with_high_score

0.7471480065031378

There is a difference in re-arrest, but not as high as assigned by the COMPAS scores. This is still a disparate impact of the actual arrests (since this not necessarily accurate as a recidivism rate, but it is true rearrest).

Now let’s measure the difference in scores when we consider both the COMPAS output and true recidivism.

Calibration¶

A discussion of using calibration to verify the fairness of a model can be found in Northpoint’s (now: Equivant) response to the ProPublica article 6.

The basic idea behind calibrating a classifier is that you want the confidence of the predictor to reflect the true outcomes. So, in a well-calibrated classifier, if 100 people are assigned 90% confidence of being in the positive class, then in reality, 90 of them should actually have had a positive label.

To use calibration as a fairness metric we compare the calibration of the classifier for each group. The smaller the difference, the more fair the calssifier.

In our problem this can be expressed as given $Y$ indicating two year recidivism, $S_Q$ indicating score (0=low, 1=high medium), and $R$ indicating race, we measure

\[ \begin{align}\begin{aligned}\mathsf{cal} \triangleq \frac{\mathbb{P}\left(Y=1\mid S_Q=s,R=\mbox{African-American} \right)}{\mathbb{P}\left(Y=1 \mid S_Q=s,R=\mbox{Caucasian} \right)},$$ for different scores $s$. Considering our quantized scores, we look at the calibration for $s=1$.\\ #### Discuss 1. Do you think this is close enough? 1. Which metric do you think is better so far?\\```{code-cell} ipython3 --- lecture_tools: block: calibration type: solution --- # compute averages dfAverage = dfQ.groupby(['race','score_text'])['two_year_recid'].mean().unstack()\\num = dfAverage.loc['African-American',1] denom = dfAverage.loc['Caucasian',1] cal = num/denom calpercent = 100*(cal-1) print('Calibration: %f' % cal) print('Calibration in percentage: %f%%' % calpercent) ```\\+++ {"lecture_tools": {"block": "calibration", "type": "interpretation"}}\\The difference looks much smaller than before. The problem of the above calibration measure is that it depends on the threshold on which we quantized the scores $S_Q$.\\In order to mitigate this, one might use a variation of this measure called *predictive parity.* In this example, we define predictive parity as\\$$\mathsf{PP}(s) \triangleq \frac{\mathbb{P}\left(Y=1\mid S\geq s,R=\mbox{African-American} \right)}{\mathbb{P}\left(Y=1 \mid S\geq s,R=\mbox{Caucasian} \right)},\end{aligned}\end{align} \]

where $S$ is the original score.

We plot $\mathsf{PP}(s) $ for $s$ from 1 to 10. Note how predictive parity depends significantly on the threshold.

# aux function for thresh score
def threshScore(x,s):
    if x>=s:
        return 1
    else:
        return 0

ppv_values = []
dfP = dfQ[['race','two_year_recid']].copy()
for s in range(1,11):
    dfP['threshScore'] = dfQ['decile_score'].apply(lambda x: threshScore(x,s))
    dfAverage = dfP.groupby(['race','threshScore'])['two_year_recid'].mean().unstack()
    num = dfAverage.loc['African-American',1]
    denom = dfAverage.loc['Caucasian',1]
    ppv_values.append(100*(num/denom-1))


plt.figure(figsize=[10,10])
plt.plot(range(1,11),ppv_values)
plt.xticks(range(1,11))
plt.xlabel('Score Threshold')
plt.ylabel('Predictive Parity (percentage)')
plt.title('Predictive parity for different thresholds\n(warning: no error bars)')

Text(0.5, 1.0, 'Predictive parity for different thresholds\n(warning: no error bars)')

Equalized Odds¶

The last fairness metric we consider is based on the difference in error rates between groups. Hardt et al. 5 propose to look at the difference in the true positive and false positive rates for each group. This aligns with the analysis performed by Propublica. We can examine these values looking at is the ROC for each group. We normalize the score between 0 and 1. The ROC thresholds produced by scikitlearn are the same.

Discuss these results and copmare how these metrics show that there is (or is not) a disparity.

# normalize decile score
max_score = dfQ['decile_score'].max()
min_score = dfQ['decile_score'].min()
dfQ['norm_score'] = (dfQ['decile_score']-min_score)/(max_score-min_score)


plt.figure(figsize=[10,10])
#plot ROC curve for African-Americans
y = dfQ.loc[dfQ['race']=='African-American',['two_year_recid','norm_score']].values
fpr1,tpr1,thresh1 = roc_curve(y_true = y[:,0],y_score=y[:,1])
plt.plot(fpr1,tpr1)

#plot ROC curve for Caucasian
y = dfQ.loc[dfQ['race']=='Caucasian',['two_year_recid','norm_score']].values
fpr2,tpr2,thresh2 = roc_curve(y_true = y[:,0],y_score=y[:,1])
plt.plot(fpr2,tpr2)
l = np.linspace(0,1,10)
plt.plot(l,l,'k--')

plt.xlabel('False Positive Rate')
plt.ylabel('True Postitive Rate')
plt.title('ROC')
plt.legend(['African-American','Caucasian'])

<matplotlib.legend.Legend at 0x7f228d098ed0>

ML4STS Lab COMPAS Tutorial