Replicating Propbulica’s COMPAS Audit Challenge


Propublica started the COMPAS Debate with the article Machine Bias. With their article, they also released details of their methodology and their data and code. This presents a real data set that can be used for research on how data is used in a criminal justice setting without researchers having to perform their own requests for information, so it has been used and reused a lot of times.

First, we need to import some common libraries,

import numpy as np
import pandas as pd
import scipy
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import roc_curve
import warnings

Propublica COMPAS Data

The dataset consists of COMPAS scores assigned to defendants over two years 2013-2014 in Broward County, Florida, it was released by Propublica in a GitHub Repository. These scores are determined by a proprietary algorithm designed to evaluate a persons recidivism risk - the likelihood that they will reoffend. Risk scoring algorithms are widely used by judges to inform their sentencing and bail decisions in the criminal justice system in the United States. The original ProPublica analysis identified a number of fairness concerns around the use of COMPAS scores, including that ‘’black defendants were nearly twice as likely to be misclassified as higher risk compared to their white counterparts.’’ Please see the full article for further details. Use pandas to read in the data and set the id column to the index.

# replace the __ with correct values
df_pp = pd.__('__',
# find the csv file in the repo and use the raw button on github to get the compatible url
df_pp = pd.read_csv("",
df_pp = pd.read_csv("",

Look at the list of columns and the first few rows to get an idea of what the dataset looks like.

['name', 'first', 'last', 'compas_screening_date', 'sex', 'dob', 'age', 'age_cat', 'race', 'juv_fel_count', 'decile_score', 'juv_misd_count', 'juv_other_count', 'priors_count', 'days_b_screening_arrest', 'c_jail_in', 'c_jail_out', 'c_case_number', 'c_offense_date', 'c_arrest_date', 'c_days_from_compas', 'c_charge_degree', 'c_charge_desc', 'is_recid', 'r_case_number', 'r_charge_degree', 'r_days_from_arrest', 'r_offense_date', 'r_charge_desc', 'r_jail_in', 'r_jail_out', 'violent_recid', 'is_violent_recid', 'vr_case_number', 'vr_charge_degree', 'vr_offense_date', 'vr_charge_desc', 'type_of_assessment', 'decile_score.1', 'score_text', 'screening_date', 'v_type_of_assessment', 'v_decile_score', 'v_score_text', 'v_screening_date', 'in_custody', 'out_custody', 'priors_count.1', 'start', 'end', 'event', 'two_year_recid']
                  name   first         last compas_screening_date   sex  \
1     miguel hernandez  miguel    hernandez            2013-08-14  Male   
3          kevon dixon   kevon        dixon            2013-01-27  Male   
4             ed philo      ed        philo            2013-04-14  Male   
5          marcu brown   marcu        brown            2013-01-13  Male   
6   bouthy pierrelouis  bouthy  pierrelouis            2013-03-26  Male   

           dob  age          age_cat              race  juv_fel_count  ...  \
id                                                                     ...   
1   1947-04-18   69  Greater than 45             Other              0  ...   
3   1982-01-22   34          25 - 45  African-American              0  ...   
4   1991-05-14   24     Less than 25  African-American              0  ...   
5   1993-01-21   23     Less than 25  African-American              0  ...   
6   1973-01-22   43          25 - 45             Other              0  ...   

    v_decile_score  v_score_text  v_screening_date  in_custody  out_custody  \
1                1           Low        2013-08-14  2014-07-07   2014-07-14   
3                1           Low        2013-01-27  2013-01-26   2013-02-05   
4                3           Low        2013-04-14  2013-06-16   2013-06-16   
5                6        Medium        2013-01-13         NaN          NaN   
6                1           Low        2013-03-26         NaN          NaN   

   priors_count.1 start   end event two_year_recid  
1               0     0   327     0              0  
3               0     9   159     1              1  
4               4     0    63     0              1  
5               1     0  1174     0              0  
6               2     0  1102     0              0  

[5 rows x 52 columns]

Data Cleaning

For this analysis, we will restrict ourselves to only a few features, and clean the dataset according to the methods using in the original ProPublica analysis.

For this tutorial, we’ve prepared a cleaned copy of the data, that we can import directly.

df = pd.__('')
df = pd.read_csv('')

Data Exploration

Next we provide a few ways to look at the relationships between the attributes in the dataset. Here is an explanation of these values:

  • age: defendant’s age

  • c_charge_degree: degree charged (Misdemeanor of Felony)

  • race: defendant’s race

  • age_cat: defendant’s age quantized in “less than 25”, “25-45”, or “over 45”

  • score_text: COMPAS score: ‘low’(1 to 5), ‘medium’ (5 to 7), and ‘high’ (8 to 10).

  • sex: defendant’s gender

  • priors_count: number of prior charges

  • days_b_screening_arrest: number of days between charge date and arrest where defendant was screened for compas score

  • decile_score: COMPAS score from 1 to 10 (low risk to high risk)

  • is_recid: if the defendant recidivized

  • two_year_recid: if the defendant within two years

  • c_jail_in: date defendant was imprisoned

  • c_jail_out: date defendant was released from jail

  • length_of_stay: length of jail stay

In particular, as in the ProPublica analysis, we are interested in the implications for the treatment of different groups as defined by some protected attribute. In particular we will consider race as the protected attribute in our analysis. Next we look at the number of entries for each race.

  1. Use value_counts to look at how much data is available for each race

African-American    3175
Caucasian           2103
Name: race, dtype: int64
  1. compare that with what the original dataset had

African-American    3696
Caucasian           2454
Hispanic             637
Other                377
Asian                 32
Native American       18
Name: race, dtype: int64

COMPAS score distribution

Let’s look at the COMPAS score distribution between African-Americans and Caucasians (matches the one in the ProPublica article).

race_score_table = df.groupby([___]).size().reset_index().pivot(index='__',columns='___',values=0)

# print percentage of defendants in each score category
race_score_table = df.groupby(['race','decile_score']).size().reset_index().pivot(

# percentage of defendants in each score category
decile_score 1 2 3 4 5 6 7 8 9 10
African-American 11.496063 10.897638 9.385827 10.614173 10.173228 10.015748 10.803150 9.480315 9.984252 7.149606
Caucasian 28.768426 15.263909 11.317166 11.554922 9.510223 7.608179 5.373276 4.564907 3.661436 2.377556

Next, make a bar plot with that table (quickest way is to use pandas plot with figsize=[12,7] to make it bigger, plot type is indicated by the kind parameter)


As you can observe, there is a large discrepancy. Does this change when we condition on other variables?

  1. Look at how priors are distributed. Follow what you did above for score by race (or continue for help)

priors = df.groupby(['race','priors_count']).size().reset_index().pivot(index='priors_count',columns='race',values=0)
  1. Look at how scores are distributed for those with more than two priors

  2. (bonus) What about with less than two priors ?(you can copy or import again the above and modify it)

  3. (bonus) Look at first time (use priors_count) felons (c_charge_degree of F) under 25. How is this different?

df_2priors = df.loc[df['priors_count']>=2]
score_2priors = df_2priors.groupby(['race','decile_score']).size().reset_index().pivot(

What happens when we take actual 2-year recidivism values into account? Are the predictions fair?

First, we’re going to load a different version of the data, it’s quantized. Then look at the correlation between the quantized score, the decile score and the actual recidivism.

dfQ = pd.read_csv('')

Is the ground truth correlated to the high/low rating (score_text)?

# measure with high-low score
two_year_recid score_text
two_year_recid 1.000000 0.314698
score_text 0.314698 1.000000

Is the ground truth correlated to the decile_scorerating?

two_year_recid decile_score
two_year_recid 1.000000 0.368193
decile_score 0.368193 1.000000

The correlation is not that high. How can we evaluate whether the predictions made by the COMPAS scores are fair, especially considering that they do not predict recidivism rates well?

Fairness Metrics

The question of how to determine if an algorithm is fair has seen much debate recently (see this tutorial from the Conference on Fairness, Acountability, and Transparency titled 21 Fairness Definitions and Their Politics.

And in fact some of the definitions are contradictory, and have been shown to be mutually exclusive [2,3]

Here we will cover 3 notions of fairness and present ways to measure them:

  1. Disparate Impact 4 The 80% rule

  2. Calibration 6

  3. Equalized Odds 5

For the rest of our analysis we will use a binary outcome - COMPAS score <= 4 is LOW RISK, >4 is HIGH RISK.

Disparate Impact

Disparate impact is a legal concept used to describe situations when an entity such as an employer inadvertently discriminates gainst a certain protected group. This is distinct from disparate treatment where discrimination is intentional.

To demonstrate cases of disparate impact, the Equal Opportunity Commission (EEOC) proposed “rule of thumb” is known as the The 80% rule.

Feldman et al. 4 adapted a fairness metric from this principle. For our application, it states that the percent of defendants predicted to be high risk in each protected group (in this case whites and African-Americans) should be within 80% of each other.

Let’s evaluate this standard for the COMPAS data.

#  Let's measure the disparate impact according to the EEOC rule
means_score = dfQ.groupby(['score_text','race']).size().unstack().reset_index()
means_score = means_score/means_score.sum()
# split this cell for the above to print
# compute disparate impact
AA_with_high_score = means_score.loc[1,'African-American']
C_with_high_score = means_score.loc[1,'Caucasian']


This ratio is below .8, so there is disparate impact by this rule. (Taking the priveleged group and the undesirable outcome instead of the disadvantaged group and the favorable outcome).

What if we apply the same rule to the true two year rearrest instead of the quantized COMPAS score?

means_2yr = dfQ.groupby(['two_year_recid','race']).size().unstack()
means_2yr = means_2yr/means_2yr.sum()

# compute disparte impact
AA_with_high_score = means_2yr.loc[1,'African-American']
C_with_high_score = means_2yr.loc[1,'Caucasian']

There is a difference in re-arrest, but not as high as assigned by the COMPAS scores. This is still a disparate impact of the actual arrests (since this not necessarily accurate as a recidivism rate, but it is true rearrest).

Now let’s measure the difference in scores when we consider both the COMPAS output and true recidivism.


A discussion of using calibration to verify the fairness of a model can be found in Northpoint’s (now: Equivant) response to the ProPublica article 6.

The basic idea behind calibrating a classifier is that you want the confidence of the predictor to reflect the true outcomes. So, in a well-calibrated classifier, if 100 people are assigned 90% confidence of being in the positive class, then in reality, 90 of them should actually have had a positive label.

To use calibration as a fairness metric we compare the calibration of the classifier for each group. The smaller the difference, the more fair the calssifier.

In our problem this can be expressed as given \(Y\) indicating two year recidivism, \(S_Q\) indicating score (0=low, 1=high medium), and \(R\) indicating race, we measure

\[ \begin{align}\begin{aligned}\mathsf{cal} \triangleq \frac{\mathbb{P}\left(Y=1\mid S_Q=s,R=\mbox{African-American} \right)}{\mathbb{P}\left(Y=1 \mid S_Q=s,R=\mbox{Caucasian} \right)},$$ for different scores $s$. Considering our quantized scores, we look at the calibration for $s=1$.\\ #### Discuss 1. Do you think this is close enough? 1. Which metric do you think is better so far?\\```{code-cell} ipython3 --- lecture_tools: block: calibration type: solution --- # compute averages dfAverage = dfQ.groupby(['race','score_text'])['two_year_recid'].mean().unstack()\\num = dfAverage.loc['African-American',1] denom = dfAverage.loc['Caucasian',1] cal = num/denom calpercent = 100*(cal-1) print('Calibration: %f' % cal) print('Calibration in percentage: %f%%' % calpercent) ```\\+++ {"lecture_tools": {"block": "calibration", "type": "interpretation"}}\\The difference looks much smaller than before. The problem of the above calibration measure is that it depends on the threshold on which we quantized the scores $S_Q$.\\In order to mitigate this, one might use a variation of this measure called *predictive parity.* In this example, we define predictive parity as\\$$\mathsf{PP}(s) \triangleq \frac{\mathbb{P}\left(Y=1\mid S\geq s,R=\mbox{African-American} \right)}{\mathbb{P}\left(Y=1 \mid S\geq s,R=\mbox{Caucasian} \right)},\end{aligned}\end{align} \]

where \(S\) is the original score.

We plot \(\mathsf{PP}(s) \) for \(s\) from 1 to 10. Note how predictive parity depends significantly on the threshold.

# aux function for thresh score
def threshScore(x,s):
    if x>=s:
        return 1
        return 0

ppv_values = []
dfP = dfQ[['race','two_year_recid']].copy()
for s in range(1,11):
    dfP['threshScore'] = dfQ['decile_score'].apply(lambda x: threshScore(x,s))
    dfAverage = dfP.groupby(['race','threshScore'])['two_year_recid'].mean().unstack()
    num = dfAverage.loc['African-American',1]
    denom = dfAverage.loc['Caucasian',1]

plt.xlabel('Score Threshold')
plt.ylabel('Predictive Parity (percentage)')
plt.title('Predictive parity for different thresholds\n(warning: no error bars)')
Text(0.5, 1.0, 'Predictive parity for different thresholds\n(warning: no error bars)')

Equalized Odds

The last fairness metric we consider is based on the difference in error rates between groups. Hardt et al. 5 propose to look at the difference in the true positive and false positive rates for each group. This aligns with the analysis performed by Propublica. We can examine these values looking at is the ROC for each group. We normalize the score between 0 and 1. The ROC thresholds produced by scikitlearn are the same.

Discuss these results and copmare how these metrics show that there is (or is not) a disparity.

# normalize decile score
max_score = dfQ['decile_score'].max()
min_score = dfQ['decile_score'].min()
dfQ['norm_score'] = (dfQ['decile_score']-min_score)/(max_score-min_score)

#plot ROC curve for African-Americans
y = dfQ.loc[dfQ['race']=='African-American',['two_year_recid','norm_score']].values
fpr1,tpr1,thresh1 = roc_curve(y_true = y[:,0],y_score=y[:,1])

#plot ROC curve for Caucasian
y = dfQ.loc[dfQ['race']=='Caucasian',['two_year_recid','norm_score']].values
fpr2,tpr2,thresh2 = roc_curve(y_true = y[:,0],y_score=y[:,1])
l = np.linspace(0,1,10)

plt.xlabel('False Positive Rate')
plt.ylabel('True Postitive Rate')
<matplotlib.legend.Legend at 0x7f2a5887f090>

Extension: CORELS

COPMAS has also been criticized for being a generally opaque system. Some machine learning models are easier to understand than others, for example a rule list is easy to understand. The CORELS system learns a rule list from the ProPublica data and reports similar accuracy.

if ({Prior-Crimes>3}) then ({label=1})
else if ({Age=18-22}) then ({label=1})
else ({label=0})

Let’s investigate how the rule learned by CORELS compares.

  1. Write a function that takes one row of the data frame and computes the corels function

  2. Use df.apply to apply your function and add a column to the data frame with the corels score

  3. Evaluate the CORELS prediction with respect to accuracy, and fairness following the above

def corels_rule(row):
    if ___:
        return __
    elif ___:
        return __
        return False

df['corels'] = df.apply(____,axis=1)

#  Let's measure the disparate impact according to the EEOC rule
means_corel = df.groupby(['corels','race']).size().unstack().reset_index()
means_corel = means_corel/means_corel.sum()
def corels_rule(row):
    if row['priors_count'] > 3:
        return True
    elif row['age'] == 'Less than 25':
        return True
        return False

df['corels'] = df.apply(corels_rule,axis=1)

#  Let's measure the disparate impact according to the EEOC rule
means_corel = df.groupby(['corels','race']).size().unstack().reset_index()
means_corel = means_corel/means_corel.sum()
race corels African-American Caucasian
0 0.0 0.617638 0.788873
1 1.0 0.382362 0.211127