In discussions of algorithm bias, the COMPAS scandal has been too often quoted out of context. This post gives the facts, and the interpretation, as quickly as possible. See this for details.

#### The fight

The COMPAS system is a statistical decision algorithm trained on past statistical data on American convicts. It takes as inputs features about the convict and outputs a "risk score" that indicates how likely the convict would reoffend if released.

In 2016, ProPublica organization claimed that COMPAS is clearly unfair for blacks in one way. Northpointe replied that it is approximately fair in another way. ProPublica rebukes with many statistical details that I didn't read.

The basic paradox at the heart of the contention is very simple and is not a simple "machines are biased because it learns from history and history is biased". It's just that there are many kinds of fairness, each may sound reasonable, but they are not compatible in realistic circumstances. Northpointe chose one and ProPublica chose another.

#### The math

The actual COMPAS gives a risk score from 1-10, but there's no need. Consider the toy example where we have a decider (COMPAS, a jury, or a judge) judging whether a group of convicts would reoffend or not. How well the decider is doing can be measured in at least three ways:

- False negative rate = (false negative)/(actual positive)
- False positive rate = (false positive)/(actual negative)
- Calibration = (true positive)/(test positive)

A good decider should have false negative rate close to 0, false positive rate close to 0, and calibration close to 1.

Visually, we can draw a "square" with four blocks:

- false negative rate = the "height" of the false negative block,
- false positive rate = the "height" of the false positive block,
- calibration = (true positive block)/(total area of the yellow blocks)

Now consider black convicts and white convicts. Now we have two squares. Since they have different reoffend rates for some reason, the central vertical line of the two squares are different.

The decider tries to be fair by making sure that the false negative rate and false positive rates are the same in both squares, but then it will be forced to make the calibration in the Whites lower than the calibration in the Blacks.

Then suppose the decider try to increase the calibration in the Whites, then the decider must somehow decrease the false negative rate of Whites, or the false positive rate of Whites.

In other words, when the base rates are different, it's impossible to have equal fairness measures in:

- false negative rate
- false positive rate
- calibration

Oh, forgot to mention, even when base rates are different, there's a way to have equal fairness measures in all three of those... But that requires the decider to be *perfect*: Its false positive rate and false negative rate must both be 0, and its calibration must be 1. This is unrealistic.

In the jargon of fairness measurement, "equal false negative rate and false positive rate" is "parity fairness"; "equal calibration" is just "calibration fairness". Parity fairness and calibration fairness can be straightforwardly generalized for COMPAS, which uses a 1-10 scoring scale, or indeed any numerical risk score.

It's some straightforward algebra to prove that in this general case, parity fairness and calibration fairness are incompatible when the base rates are different, and the decider is not perfect.

#### The fight, after-math

Northpointe showed that COMPAS is approximately fair in calibration for Whites and Blacks. ProPublica showed that COMPAS is unfair in parity.

The lesson is that there are incompatible fairnesses. To figure out which to apply -- that is a different question.

I like the no-nonsense section titles!

I also like the attempt to graphically teach the conflict between the different fairness desiderata using squares, but I think I would need a few more intermediate diagrams (or probably, to work them out myself) to really "get it." I think the standard citation here is "Inherent Trade-Offs in the Fair Determination of Risk Scores", but that presentation has a lot more equations and fewer squares.

Yes, (Kleinberg et al, 2016)... Do not read it. Really, don't. The derivation is extremely clumsy (and my professor said so too).

The proof has been considerably simplified in subsequent works. Look around for papers that cite that paper should give a published paper that does the simplification...

Actually, Kleinberg et al. 2016 isn't all that bad. They have a small paragraph at the beginning of section 2 which they call an "informal overview" over the proof. But it's actually almost a decent proof in and of itself. You may accept it as such, or you may write it down a bit more formally, and you end up with a short, sweet proof. The reason they can't use a graphical approach like the one in this blog entry is that the above diagram with the squares only applies to the special case of scores that either output 0 or 1, but nothing in between. That is an important special case, but a special case nevertheless. Kleinberg et al. deal with the more common and slightly more general case of scores which can take any real value from 0 to 1. Also the COMPAS score, which is the topic of the ProPublica report cited above, can take other values than just 0 and 1.

By the way, also the introductory section of the Kleinberg-et-al-paper is definitely worth reading. It gives an overview over the relevance of the problem for other areas of application. So only their attempt at a formal proof is kind of a waste of time to read.

I like the idea of clearly showing the core of the problem using a graphical approach, namely how the different base rates keep us from having both kinds of fairness.

There is one glitch, I'm afraid: It seems you got the notion of calibration wrong. In your way of using the word, an ideal calibration would be a perfect score, i.e. a score that outputs 1 for all the true positives and 0 for all the true negatives. While perfect scores play a certain role in Kleinberg et al's paper as an unrealistic corner case of their theorem, the standard notion of calibration is a different one: It demands that when you look at a score bracket (the set of all people having approximately the same score), the actual fraction of positive instances in this group should (approximately) coincide with the score value in this bracket. To avoid discrimination, one also checks that this is true for white and for black defendants separately.

Fortunately, your approach still works with this definition. In your drawing, it translates into the demand that, in each of the two squares, the yellow area must be as large as the left column (the actual positives). Assume that this is the case in the upper drawing. When we go from the upper to the lower drawing, the boundary between the left and right column moves to the right, as the base rate is higher among blacks. This is nicely indicated with the red arrows in the lower drawing. So the area of the left column increases. But of this newly acquired territory of the left column, only a part is also a new part of the yellow area. Another part was yellow and stays yellow, and a third part is now in the left column, but not part of the yellow area. Hence, in the lower drawing, the left column is larger than the yellow area.

>> False negative rate = (false negative)/(actual positive)

>> False positive rate = (false positive)/(actual negative)

Correct me if I’m mistaken, but isn’t it:

False negative rate = (false negative)/(false negative + actual positive)

False positive rate = (false positive)/(false positive + actual negative)

I'll side with ProPublica, because my understanding of fairness (equal treatment for everyone) seems to be closer to parity than calibration. For example, a test that always returns positive or always flips a coin is parity-fair but not calibration-fair.

Would you have the same position if the algorithm would keep blacks in prison at much higher rates than their risk of recidivism warranted? (Instead of vice versa)

How do you feel about insurance? Insurance is, understandably, calibration-fair. If men have higher rates of car accidents than women, they pay a higher rate. They are not treated equally to women. You would prefer to eliminate discrimination on such variables? Wouldn't you have to eliminate all variables in order to treat everyone equally?

I don't think everyone should be parity-fair to everyone else - that's unfeasible. But I do think the government should be parity-fair. For example, a healthcare safety net shouldn't rely on free market insurance where the sick pay more. It's better to have a system like in Switzerland where everyone pays the same.

90% of the work ought to go into figuring out what fairness measure you want and why. Not so easy. Also not really a "math problem." Most ML papers on fairness just solve math problems.

Seconding Daniel.

This post does what it says in the title. With diagrams! It's not about AI safety or anything, but it's a high-quality explainer on a much-discussed topic, that I'd be happy to link people to.

Didn't you just show that "machines are biased because it learns from history and history is biased" is indeed the case? The base rates differ because of historical circumstances.

I'm following common speech where "biased" means "statistically immoral, because it violates some fairness requirement".

I showed that with base rate difference, it's impossible to satisfy three fairness requirements. The decider (machine or not) can completely ignore history. It could be a coin-flipper. As long as the decider is imperfect, it would still be unfair in one of the fairness requirements.

And if the base rates are not due to historical circumstances, this impossibility still stands.

I'm not sure what "statistically immoral" means nor have I ever heard the term, which makes me doubt it's common speech (googling it does not bring up any uses of the phrase).

I think we're using the term "historical circumstances" differently; I simply mean what's happened in the past. Isn't the base rate purely a function of the records of white/black convictions? If so, then the fact that the rates are not the same is the reason that we run into this fairness problem. I agree that this problem can apply in other settings, but in the case where the base rate is a function of history, is it not accurate to say that the cause of the conundrum is historical circumstances? An alternative history with equal, or essentially equal, rates of convictions would not suffer from this problem, right?

I think what people mean when they say things like "machines are biased because they learn from history and history is biased" is precisely this scenario: historically, conviction rates are not equal between racial groups and so any algorithm that learns to predict convictions based on historical data will inevitably suffer from the same inequality (or suffer from some other issue by trying to fix this one, as your analysis has shown).

No. Any decider will be unfair in some way, whether it knows anything about history at all. The decider can be a coin flipper and it would still be biased. One can say that the unfairness is baked into the reality of base-rate difference.

The only way to fix this is not fixing the decider, but to just somehow make the base-rate difference disappear, or to compromise on the definition of fairness so that it's not so stringent, and satisfiable.

And in common language and common discussion of algorithmic bias, "bias" is decidedly NOT merely a statistical definition. It always contains a moral judgment: violation of a fairness requirement. To say that a decider is biased is to say that the statistical pattern of its decision violates a fairness requirement.

The key message is that, by the common language definition, "bias" is unavoidable. No amount of trying to fix the decider will make it fair. Blinding it to the history will do nothing. The unfairness is in the base rate, and in the definition of fairness.

The base rates in the diagram are not historical but "potential" rates. They show the proportion of current inmates up for parole who would be re-arrested if paroled. In practice this is indeed estimated by looking at historical rates but as long as the true base rates are different in reality, no algorithm can be fair in the two senses described above.

Afaik, in ML, the term bias is used to describe any move away from the uniform / mean case. But in common speech, such a move would only be called a bias if it's inaccurate. So if the algorithm learns a true pattern in the data (X is more likely to be classified as 1 than Y is) that wouldn't be called a bias. Unless I misunderstand your point.

1. No-one has access to the actual "re-offend" rates: all we have is "re-arrest," "re-convict," or at best "observed and reported re-offence" rates.

2. A-priori we do not expect the amount of melanin in a person's skin, or the word they write down on a form next to the prompt "Race" to be correlated with the risk of re-offense. So, any tool that looks at "a bunch of factors" and comes up with "Black people are more likely to re-offend" is "biased" compared to our prior (even if our prior is wrong).

All evidence is "biased compared to our prior". That is what evidence is.

While this is a nice summary of classifier trade-offs, I think you are entirely too dismissive of the role of history in the dataset, and if I didn't know any better, I would walk away with the idea that fairness comes down to just choosing an optimal trade-off for a classifier. If you had read any of the technical response, you would have noticed that when controlling for "recidivism, criminal history, age and gender across races, black defendants were 45 percent more likely to get a higher score". Controls are important because they let you get at the underlying causal model, which is more important for predicting a person's recidivism than what statistical correlations will tell you. Choosing the right causal model is not an easy problem, but it is at the heart of what we mean when we conventionally talk about fairness.