Select Page

Analysis of Police shootings data from Washington Post

Background

The purpose of this analysis is to verify the findings in the article Fatal Force posted on the Washington Post website: sources of data, analysis and conclusions.

The article claims that “Black Americans are killed at a much higher rate than white Americans”. I am going to test whether or not this statement is supported by the numbers.

Methods

For this analysis I will be using the data pointed by the article, or in cases in which the source was not named, a best-to-my-knowledge source of similar data. The analysis is being performed using Jupyter Notebook server 6.0.3 running Python 3.7.6 (default, Jan 8 2020, 13:42:34), with an R kernel version 3.6.1 (2019-07-05).

Results

First of all, let’s load the data used in the article straight from the author’s github and do some basic review:

In [7]:
a = read.csv(url("https://raw.githubusercontent.com/washingtonpost/data-police-shootings/master/fatal-police-shootings-data.csv"))
a$date = as.Date(as.character(a$date))
head(a)
nrow(a)
range(a$date)
A data.frame: 6 × 14
idnamedatemanner_of_deatharmedagegenderracecitystatesigns_of_mental_illnessthreat_levelfleebody_camera
<int><fct><date><fct><fct><int><fct><fct><fct><fct><fct><fct><fct><fct>
1 3Tim Elliot 2015-01-02shot gun 53MAShelton WATrue attackNot fleeingFalse
2 4Lewis Lee Lembke 2015-01-02shot gun 47MWAloha ORFalseattackNot fleeingFalse
3 5John Paul Quintero2015-01-03shot and Taseredunarmed 23MHWichita KSFalseother Not fleeingFalse
4 8Matthew Hoffman 2015-01-04shot toy weapon32MWSan FranciscoCATrue attackNot fleeingFalse
5 9Michael Rodriguez 2015-01-04shot nail gun 39MHEvans COFalseattackNot fleeingFalse
611Kenneth Joe Brown 2015-01-04shot gun 18MWGuthrie OKFalseattackNot fleeingFalse
5429

As we see, there are a total of 5429 records on fatal police shootings, spanning years 2015 to 2020. The first summary to look at would be the review of the total numbers by race.

In [8]:
summary(a$race)
1
528
A
94
B
1299
H
904
N
78
O
48
W
2478

This doesn’t look very informative on its own. I need to adjust the symbols in the “race” column to be human readable. I will also “fold” the numbers of Asians and Native Americans into “Other” for simplicity, just like in the article.

In [9]:
levels(a$race) = c("Unknown","Other","Black","Hispanic","Other","Other","White")
a$race = factor(a$race, levels=c("Black","Hispanic","White","Other","Unknown"))
summary(a$race)
Black
1299
Hispanic
904
White
2478
Other
220
Unknown
528

There is an unnamed category in the race column which is being omitted in the article. In my opinion, the “Unknown” group is of considerable size so it should not be ignored, however for the analysis of Black vs White americans – it really doesn’t matter.

Next, to continue the analysis we need to get population size. The article does not mention the source of this data – a minor flaw. After all, the article is not a formal report but a data-based editorial, and it doesn’t have to name all its sources. Nevertheless, if all sources were named it would help in reproducing the analysis…

Quick google search took me to a “Kaiser Family Foundation” website – a portal on public health information, that has fairly detailed population information for years 2008-2018. The 2018 data are accessible under this link: https://www.kff.org/other/state-indicator/distribution-by-raceethnicity/?dataView=1&currentTimeframe=0&sortModel=%7B"colId":"Location","sort":"asc"%7D.

I was not able to download this data directly, so I had to manually copy/paste the numbers. I only used years 2015-2018 as these would overlap with the fatal force data. By the way, I assigned the category Unknown to the category two or more races on the KFF website.

In [10]:
b = data.frame("Year"=2018, "Black"=38655700, "Hispanic"=58483600,"White"=192117000,"Other"=2082800+17844800+519500,"Unknown"=8795000)
b = rbind(b, c(2017,38408000,57560600,192336100,2039400+17651200+502500,8524700))
b = rbind(b, c(2016,38081700,56144400,192537500,2041500+17004500+514600,8142200))
b = rbind(b, c(2015,37923800,55241300,192593600,1986900+16755300+466400,7810600))
b
A data.frame: 4 × 6
YearBlackHispanicWhiteOtherUnknown
<dbl><dbl><dbl><dbl><dbl><dbl>
20183865570058483600192117000204471008795000
20173840800057560600192336100201931008524700
20163808170056144400192537500195606008142200
20153792380055241300192593600192086007810600

It is now time to make the first calculation: to get the rates. I therefore divide the totals from the Washington Post data by the population numbers. Doing that I hit a minor problem – which year to select for population? This was not made clear in the article.

Another minor point is that: if we use total numbers of killings, these are accumulated over 5 years. Therefore the calculated rates are 5-year rates, not annual rates that most people would naturally expect.

In [11]:
summary(a$race)/b[b$Year==2018,c(2:6)] *1000000
A data.frame: 1 × 5
BlackHispanicWhiteOtherUnknown
<dbl><dbl><dbl><dbl><dbl>
133.6043615.4573212.8983910.7594760.03411

So, per million of total population in 2018, the 5-year rates calculated by me are similar to those in the article for Blacks and Whites, though very different for the Hispanics and Other races. That may be due to using a different source of population data, or a different year. In any case, the Blacks and Whites are the focus point of this analysis, so let’s just continue.

It would be now a good time to make some plots.

In [12]:
z = aggregate(a$id, list("Race"=a$race, "Year"=format(a$date, "%Y")), FUN=length)

boxplot(z$x~z$Race, range=0, xlab="", ylab="Number of people killed in a year", main="Raw numbers of people killed by police")

barplot(height=unlist(summary(a$race)/b[b$Year==2018,c(2:6)] *1000000), width=unlist(b[b$Year==2018,c(2:6)]), space=0.05,
       xlab="U.S. population", ylab="Rate of police killings per million of population")

Now I see that the rate for the Unknown group is exceptionally high, but I don’t really know much about that group. Perhaps it was a good idea to ignore it after all. Let’s do that.

In [13]:
a = a[a$race!="Unknown",]
a$race = factor(a$race)
summary(a$race)
Black
1299
Hispanic
904
White
2478
Other
220
In [14]:
barplot(height=unlist(summary(a$race)/b[b$Year==2018,c(2:5)] *1000000), width=unlist(b[b$Year==2018,c(2:6)]), space=0.05,
       xlab="U.S. population", ylab="Rate of police killings per million of population", col=c("red","gray","gray","gray"))

And now this plot more or less resembles the plot found in the article, strongly suggesting that the Black americans are being killed by the police at a much higher rate than other races. Such was the article’s conclusion.

There is one minor methodological issue that bothers me in this plot: the population is included in both x and y axis. Remember that the rate (y-axis) is total-of-killings/population, and then we have population again on the x-axis. However, in the end it is the height of the bars what human eye notices immediately, and that’s where the point is located, so no big deal about that.

Discussion

The information shown in the plot is obviously mathematically correct, but is it conveying the message we think it is? The situation displayed by the above plot, with all the underlying assumptions, can be interpreted by this scenario:

“A policeman goes out to a random neighbourhood, pulls out a gun and shoots at random people. On average, per million of 2018 population, over 5 years, he kills 34 black people and 13 white people”.

Do you think this is real? The issues with this scenario are the following:

  • In reality, policemen do not just go out to random neighbourhoods – they go more frequently to areas where there’s more crime, and less frequently to areas where there’s less crime.
  • In reality, policemen do not just draw guns without a reason. At least a suspicion of a crime or a dangerous situation requiring the use of force usually exists before the gun is being pulled.
  • In reality, policemen do not shoot just anywhere, killing people at random. They aim and shoot at people that they believe are endangering someone.

I think these three issues can be folded into a single term: crime rate. The amount of high-crime neighbourhoods and the amount of violent crimes would have to be independent of race. The message from the Washington Post article would be therefore correct under the assumption that the crime rates are identical across races. Are they? Let’s verify.

Enter crime rates

Quick google search took me to the website of US Department of Justice, Bureau of Justice Statistics. The Bureau publishes annual reports titled “Criminal Victimization”. These reports, however, contain a lot more information than suggested by the title.

In any case, the report for 2018 is located here: https://www.bjs.gov/content/pub/pdf/cv18.pdf. I am using 2018 as it is the latest available year – and it happens to be the same year as population info from KFF.org used above, so we will have a good comparison. There in the Table 12 I found numbers of violent crimes by race – which I believe is the best measure for situations in which police officers would draw their guns and shoot. Of course, I took the numbers of offenders, not victims.

Again, there is no programmatic access to this data (that I know of), so I just copy/pasted the numbers from the report:

In [15]:
offenses = c("Black"=1155670, "Hispanic"=767560, "White"=2669900, "Other"=131120+480290+115800)
offenses
Black
1155670
Hispanic
767560
White
2669900
Other
727210
In [16]:
killings = summary(a$race)
killings = killings[c("Black","Hispanic","White","Other")]
killings
Black
1299
Hispanic
904
White
2478
Other
220

Calculation of rate of killings adjusted for crime rate

Now, since we have the numbers of people killed (over 5 years) as well as estimated numbers of violent offenders the year of 2018, we can calculate the corrected rate of police killings. I needed to multiply these numbers by 100,000 population to get the right scale.

In [17]:
round((killings/offenses)*100000,1)
Black
112.4
Hispanic
117.8
White
92.8
Other
30.3
In [19]:
barplot((killings/offenses)*100000, col=c("gray","red","gray","gray"), ylab="Police killings per 100,000 offenders", space=0.05)

The way to interpret these numbers is this: on average, across the USA, for the years 2015-2020, for every 100,000 violent offenders, 112 black people, 118 hispanic people and 93 white people were killed by the police. If anyone wanted to insist that police is targeting any race – it would be Hispanics, not Blacks. But the differences between groups are small enough to to say that the rates are actually quite similar.

Conclusion

What can I say? Once you take into account that police does not just kill people at random, but rather those who are in trouble – in this way or the other – with the law, then it turns out the police kills white americans at a rate quite similar to that of black americans. This is according to the Washington Post data, and data obtained from Dept of Justice.

Therefore, the original conclusion from the article, that “Black Americans are killed at a much higher rate than white Americans” is false and unsupported by the numbers. Furthermore, the Washington Post analysis appears professionally done, therefore done by people with knowledge on how to do statistics properly, but also with knowledge on how to manipulate it. It is quite likely, in my opinion, that the analyzed article is a deliberate manipulation.

In the end, the final conclusion is this: Black Americans are NOT being killed at a higher rate than White Americans.

Final remarks

I will continue to look at this data from various angles – this report is just a very simple beginning. I will also look for other sources of data to see if perhaps the Washington Post dataset is not biased.

To be continued…