Replicating Propbulica’s COMPAS Audit in TD STEM Academy 2021


Propublica started the COMPAS Debate with the article Machine Bias. With their article, they also released details of their methodology and their data and code. This presents a real data set that can be used for research on how data is used in a criminal justice setting without researchers having to perform their own requests for information, so it has been used and reused a lot of times.

First, we need to import some common libraries,

import numpy as np
import pandas as pd
import scipy
import matplotlib.pyplot as plt
import seaborn as sns
import itertools
from sklearn.metrics import roc_curve
import warnings

Propublica COMPAS Data

The dataset consists of COMPAS scores assigned to defendants over two years 2013-2014 in Broward County, Florida, it was released by Propublica in a GitHub Repository. These scores are determined by a proprietary algorithm designed to evaluate a persons recidivism risk - the likelihood that they will reoffend. Risk scoring algorithms are widely used by judges to inform their sentencing and bail decisions in the criminal justice system in the United States. The original ProPublica analysis identified a number of fairness concerns around the use of COMPAS scores, including that ‘’black defendants were nearly twice as likely to be misclassified as higher risk compared to their white counterparts.’’ Please see the full article for further details. Use pandas to read in the data and set the id column to the index.

df_pp = pd.read_csv("",

Look at the list of columns and the first few rows to get an idea of what the dataset looks like.

['name', 'first', 'last', 'compas_screening_date', 'sex', 'dob', 'age', 'age_cat', 'race', 'juv_fel_count', 'decile_score', 'juv_misd_count', 'juv_other_count', 'priors_count', 'days_b_screening_arrest', 'c_jail_in', 'c_jail_out', 'c_case_number', 'c_offense_date', 'c_arrest_date', 'c_days_from_compas', 'c_charge_degree', 'c_charge_desc', 'is_recid', 'r_case_number', 'r_charge_degree', 'r_days_from_arrest', 'r_offense_date', 'r_charge_desc', 'r_jail_in', 'r_jail_out', 'violent_recid', 'is_violent_recid', 'vr_case_number', 'vr_charge_degree', 'vr_offense_date', 'vr_charge_desc', 'type_of_assessment', 'decile_score.1', 'score_text', 'screening_date', 'v_type_of_assessment', 'v_decile_score', 'v_score_text', 'v_screening_date', 'in_custody', 'out_custody', 'priors_count.1', 'start', 'end', 'event', 'two_year_recid']
name first last compas_screening_date sex dob age age_cat race juv_fel_count ... v_decile_score v_score_text v_screening_date in_custody out_custody priors_count.1 start end event two_year_recid
1 miguel hernandez miguel hernandez 2013-08-14 Male 1947-04-18 69 Greater than 45 Other 0 ... 1 Low 2013-08-14 2014-07-07 2014-07-14 0 0 327 0 0
3 kevon dixon kevon dixon 2013-01-27 Male 1982-01-22 34 25 - 45 African-American 0 ... 1 Low 2013-01-27 2013-01-26 2013-02-05 0 9 159 1 1
4 ed philo ed philo 2013-04-14 Male 1991-05-14 24 Less than 25 African-American 0 ... 3 Low 2013-04-14 2013-06-16 2013-06-16 4 0 63 0 1
5 marcu brown marcu brown 2013-01-13 Male 1993-01-21 23 Less than 25 African-American 0 ... 6 Medium 2013-01-13 NaN NaN 1 0 1174 0 0
6 bouthy pierrelouis bouthy pierrelouis 2013-03-26 Male 1973-01-22 43 25 - 45 Other 0 ... 1 Low 2013-03-26 NaN NaN 2 0 1102 0 0

5 rows × 52 columns

Data Cleaning

For this analysis, we will restrict ourselves to only a few features, and clean the dataset according to the methods using in the original ProPublica analysis.

For this tutorial, we’ve prepared a cleaned copy of the data, that we can import directly.

df = pd.read_csv('')

Data Exploration

Next we provide a few ways to look at the relationships between the attributes in the dataset. Here is an explanation of these values:

  • age: defendant’s age

  • c_charge_degree: degree charged (Misdemeanor of Felony)

  • race: defendant’s race

  • age_cat: defendant’s age quantized in “less than 25”, “25-45”, or “over 45”

  • score_text: COMPAS score: ‘low’(1 to 5), ‘medium’ (5 to 7), and ‘high’ (8 to 10).

  • sex: defendant’s gender

  • priors_count: number of prior charges

  • days_b_screening_arrest: number of days between charge date and arrest where defendant was screened for compas score

  • decile_score: COMPAS score from 1 to 10 (low risk to high risk)

  • is_recid: if the defendant recidivized

  • two_year_recid: if the defendant within two years

  • c_jail_in: date defendant was imprisoned

  • c_jail_out: date defendant was released from jail

  • length_of_stay: length of jail stay

In particular, as in the ProPublica analysis, we are interested in the implications for the treatment of different groups as defined by some protected attribute. In particular we will consider race as the protected attribute in our analysis. Next we look at the number of entries for each race.

  1. Use value_counts to look at how much data is available for each race and compare the original and clean versions

African-American    3175
Caucasian           2103
Name: race, dtype: int64

COMPAS score distribution

Let’s look at the COMPAS score distribution between African-Americans and Caucasians (matches the one in the ProPublica article).

race_score_table = df.groupby(['race','decile_score']).size().reset_index().pivot(

# percentage of defendants in each score category
decile_score 1 2 3 4 5 6 7 8 9 10
African-American 11.496063 10.897638 9.385827 10.614173 10.173228 10.015748 10.803150 9.480315 9.984252 7.149606
Caucasian 28.768426 15.263909 11.317166 11.554922 9.510223 7.608179 5.373276 4.564907 3.661436 2.377556

Next, make a bar plot with that table (quickest way is to use pandas plot with figsize=[12,7] to make it bigger, plot type is indicated by the kind parameter)


As you can observe, there is a large discrepancy. Does this change when we condition on other variables?

  1. Look at how priors are distributed. Follow what you did above for score by race (or continue for help)

priors = df.groupby(['race','priors_count']).size().reset_index().pivot(index='priors_count',columns='race',values=0)
  1. Look at how scores are distributed for those with more than two priors

  2. (bonus) What about with less than two priors ?(you can copy or import again the above and modify it)

  3. (bonus) Look at first time (use priors_count) felons (c_charge_degree of F) under 25. How is this different?

df_2priors = df.loc[df['priors_count']>=2]
score_2priors = df_2priors.groupby(['race','decile_score']).size().reset_index().pivot(

What happens when we take actual 2-year recidivism values into account? Are the predictions fair?

First, we’re going to load a different version of the data, it’s quantized. Then look at the correlation between the quantized score, the decile score and the actual recidivism.

dfQ = pd.read_csv('')

Is the ground truth correlated to the high/low rating (score_text)?

# measure with high-low score
two_year_recid score_text
two_year_recid 1.000000 0.314698
score_text 0.314698 1.000000

Is the ground truth correlated to the decile_scorerating?

two_year_recid decile_score
two_year_recid 1.000000 0.368193
decile_score 0.368193 1.000000

The correlation is not that high. How can we evaluate whether the predictions made by the COMPAS scores are fair, especially considering that they do not predict recidivism rates well?

Fairness Metrics

The question of how to determine if an algorithm is fair has seen much debate recently (see this tutorial from the Conference on Fairness, Acountability, and Transparency titled 21 Fairness Definitions and Their Politics.

And in fact some of the definitions are contradictory, and have been shown to be mutually exclusive [2,3]

Here we will cover 3 notions of fairness and present ways to measure them:

  1. Disparate Impact 4 The 80% rule

  2. Calibration 6

  3. Equalized Odds 5

For the rest of our analysis we will use a binary outcome - COMPAS score <= 4 is LOW RISK, >4 is HIGH RISK.

Disparate Impact

Disparate impact is a legal concept used to describe situations when an entity such as an employer inadvertently discriminates gainst a certain protected group. This is distinct from disparate treatment where discrimination is intentional.

To demonstrate cases of disparate impact, the Equal Opportunity Commission (EEOC) proposed “rule of thumb” is known as the The 80% rule.

Feldman et al. 4 adapted a fairness metric from this principle. For our application, it states that the percent of defendants predicted to be high risk in each protected group (in this case whites and African-Americans) should be within 80% of each other.

Let’s evaluate this standard for the COMPAS data.

#  Let's measure the disparate impact according to the EEOC rule
means_score = dfQ.groupby(['score_text','race']).size().unstack().reset_index()
means_score = means_score/means_score.sum()
# split this cell for the above to print
# compute disparate impact
AA_with_high_score = means_score.loc[1,'African-American']
C_with_high_score = means_score.loc[1,'Caucasian']


This ratio is below .8, so there is disparate impact by this rule. (Taking the priveleged group and the undesirable outcome instead of the disadvantaged group and the favorable outcome).

What if we apply the same rule to the true two year rearrest instead of the quantized COMPAS score?

means_2yr = dfQ.groupby(['two_year_recid','race']).size().unstack()
means_2yr = means_2yr/means_2yr.sum()

# compute disparte impact
AA_with_high_score = means_2yr.loc[1,'African-American']
C_with_high_score = means_2yr.loc[1,'Caucasian']

There is a difference in re-arrest, but not as high as assigned by the COMPAS scores. This is still a disparate impact of the actual arrests (since this not necessarily accurate as a recidivism rate, but it is true rearrest).

Now let’s measure the difference in scores when we consider both the COMPAS output and true recidivism.


A discussion of using calibration to verify the fairness of a model can be found in Northpoint’s (now: Equivant) response to the ProPublica article 6.

The basic idea behind calibrating a classifier is that you want the confidence of the predictor to reflect the true outcomes. So, in a well-calibrated classifier, if 100 people are assigned 90% confidence of being in the positive class, then in reality, 90 of them should actually have had a positive label.

To use calibration as a fairness metric we compare the calibration of the classifier for each group. The smaller the difference, the more fair the calssifier.

In our problem this can be expressed as given \(Y\) indicating two year recidivism, \(S_Q\) indicating score (0=low, 1=high medium), and \(R\) indicating race, we measure

\[ \begin{align}\begin{aligned}\mathsf{cal} \triangleq \frac{\mathbb{P}\left(Y=1\mid S_Q=s,R=\mbox{African-American} \right)}{\mathbb{P}\left(Y=1 \mid S_Q=s,R=\mbox{Caucasian} \right)},$$ for different scores $s$. Considering our quantized scores, we look at the calibration for $s=1$.\\ #### Discuss 1. Do you think this is close enough? 1. Which metric do you think is better so far?\\```{code-cell} ipython3 --- lecture_tools: block: calibration type: solution --- # compute averages dfAverage = dfQ.groupby(['race','score_text'])['two_year_recid'].mean().unstack()\\num = dfAverage.loc['African-American',1] denom = dfAverage.loc['Caucasian',1] cal = num/denom calpercent = 100*(cal-1) print('Calibration: %f' % cal) print('Calibration in percentage: %f%%' % calpercent) ```\\+++ {"lecture_tools": {"block": "calibration", "type": "interpretation"}}\\The difference looks much smaller than before. The problem of the above calibration measure is that it depends on the threshold on which we quantized the scores $S_Q$.\\In order to mitigate this, one might use a variation of this measure called *predictive parity.* In this example, we define predictive parity as\\$$\mathsf{PP}(s) \triangleq \frac{\mathbb{P}\left(Y=1\mid S\geq s,R=\mbox{African-American} \right)}{\mathbb{P}\left(Y=1 \mid S\geq s,R=\mbox{Caucasian} \right)},\end{aligned}\end{align} \]

where \(S\) is the original score.

We plot \(\mathsf{PP}(s) \) for \(s\) from 1 to 10. Note how predictive parity depends significantly on the threshold.

# aux function for thresh score
def threshScore(x,s):
    if x>=s:
        return 1
        return 0

ppv_values = []
dfP = dfQ[['race','two_year_recid']].copy()
for s in range(1,11):
    dfP['threshScore'] = dfQ['decile_score'].apply(lambda x: threshScore(x,s))
    dfAverage = dfP.groupby(['race','threshScore'])['two_year_recid'].mean().unstack()
    num = dfAverage.loc['African-American',1]
    denom = dfAverage.loc['Caucasian',1]

plt.xlabel('Score Threshold')
plt.ylabel('Predictive Parity (percentage)')
plt.title('Predictive parity for different thresholds\n(warning: no error bars)')
Text(0.5, 1.0, 'Predictive parity for different thresholds\n(warning: no error bars)')

Equalized Odds

The last fairness metric we consider is based on the difference in error rates between groups. Hardt et al. 5 propose to look at the difference in the true positive and false positive rates for each group. This aligns with the analysis performed by Propublica. We can examine these values looking at is the ROC for each group. We normalize the score between 0 and 1. The ROC thresholds produced by scikitlearn are the same.

Discuss these results and copmare how these metrics show that there is (or is not) a disparity.

# normalize decile score
max_score = dfQ['decile_score'].max()
min_score = dfQ['decile_score'].min()
dfQ['norm_score'] = (dfQ['decile_score']-min_score)/(max_score-min_score)

#plot ROC curve for African-Americans
y = dfQ.loc[dfQ['race']=='African-American',['two_year_recid','norm_score']].values
fpr1,tpr1,thresh1 = roc_curve(y_true = y[:,0],y_score=y[:,1])

#plot ROC curve for Caucasian
y = dfQ.loc[dfQ['race']=='Caucasian',['two_year_recid','norm_score']].values
fpr2,tpr2,thresh2 = roc_curve(y_true = y[:,0],y_score=y[:,1])
l = np.linspace(0,1,10)

plt.xlabel('False Positive Rate')
plt.ylabel('True Postitive Rate')
<matplotlib.legend.Legend at 0x7f228d098ed0>

Extension: CORELS

COPMAS has also been criticized for being a generally opaque system. Some machine learning models are easier to understand than others, for example a rule list is easy to understand. The CORELS system learns a rule list from the ProPublica data and reports similar accuracy.

if ({Prior-Crimes>3}) then ({label=1})
else if ({Age=18-22}) then ({label=1})
else ({label=0})

Let’s investigate how the rule learned by CORELS compares.

  1. Write a function that takes one row of the data frame and computes the corels function

  2. Use df.apply to apply your function and add a column to the data frame with the corels score

  3. Evaluate the CORELS prediction with respect to accuracy, and fairness following the above

def corels_rule(row):
    if row['priors_count'] > 3:
        return True
    elif row['age'] == 'Less than 25':
        return True
        return False

df['corels'] = df.apply(corels_rule,axis=1)

#  Let's measure the disparate impact according to the EEOC rule
means_corel = df.groupby(['corels','race']).size().unstack().reset_index()
means_corel = means_corel/means_corel.sum()
race corels African-American Caucasian
0 0.0 0.617638 0.788873
1 1.0 0.382362 0.211127