Analysis of Police shootings data from Washington Post¶
Table of contents¶
Background¶
The purpose of this analysis is to verify the findings in the article Fatal Force posted on the Washington Post website: sources of data, analysis and conclusions.
The article claims that “Black Americans are killed at a much higher rate than white Americans”. I am going to test whether or not this statement is supported by the numbers.
Methods¶
For this analysis I will be using the data pointed by the article, or in cases in which the source was not named, a best-to-my-knowledge source of similar data. The analysis is being performed using Jupyter Notebook server 6.0.3 running Python 3.7.6 (default, Jan 8 2020, 13:42:34), with an R kernel version 3.6.1 (2019-07-05).
Results¶
First of all, let’s load the data used in the article straight from the author’s github and do some basic review:
a = read.csv(url("https://raw.githubusercontent.com/washingtonpost/data-police-shootings/master/fatal-police-shootings-data.csv"))
a$date = as.Date(as.character(a$date))
head(a)
nrow(a)
range(a$date)
As we see, there are a total of 5429 records on fatal police shootings, spanning years 2015 to 2020. The first summary to look at would be the review of the total numbers by race.
summary(a$race)
This doesn’t look very informative on its own. I need to adjust the symbols in the “race” column to be human readable. I will also “fold” the numbers of Asians and Native Americans into “Other” for simplicity, just like in the article.
levels(a$race) = c("Unknown","Other","Black","Hispanic","Other","Other","White")
a$race = factor(a$race, levels=c("Black","Hispanic","White","Other","Unknown"))
summary(a$race)
There is an unnamed category in the race
column which is being omitted in the article. In my opinion, the “Unknown” group is of considerable size so it should not be ignored, however for the analysis of Black vs White americans – it really doesn’t matter.
Next, to continue the analysis we need to get population size. The article does not mention the source of this data – a minor flaw. After all, the article is not a formal report but a data-based editorial, and it doesn’t have to name all its sources. Nevertheless, if all sources were named it would help in reproducing the analysis…
Quick google search took me to a “Kaiser Family Foundation” website – a portal on public health information, that has fairly detailed population information for years 2008-2018. The 2018 data are accessible under this link: https://www.kff.org/other/state-indicator/distribution-by-raceethnicity/?dataView=1¤tTimeframe=0&sortModel=%7B"colId":"Location","sort":"asc"%7D.
I was not able to download this data directly, so I had to manually copy/paste the numbers. I only used years 2015-2018 as these would overlap with the fatal force
data. By the way, I assigned the category Unknown
to the category two or more races
on the KFF website.
b = data.frame("Year"=2018, "Black"=38655700, "Hispanic"=58483600,"White"=192117000,"Other"=2082800+17844800+519500,"Unknown"=8795000)
b = rbind(b, c(2017,38408000,57560600,192336100,2039400+17651200+502500,8524700))
b = rbind(b, c(2016,38081700,56144400,192537500,2041500+17004500+514600,8142200))
b = rbind(b, c(2015,37923800,55241300,192593600,1986900+16755300+466400,7810600))
b
It is now time to make the first calculation: to get the rates. I therefore divide the totals from the Washington Post data by the population numbers. Doing that I hit a minor problem – which year to select for population? This was not made clear in the article.
Another minor point is that: if we use total numbers of killings, these are accumulated over 5 years. Therefore the calculated rates are 5-year rates, not annual rates that most people would naturally expect.
summary(a$race)/b[b$Year==2018,c(2:6)] *1000000
So, per million of total population in 2018, the 5-year rates calculated by me are similar to those in the article for Blacks and Whites, though very different for the Hispanics and Other races. That may be due to using a different source of population data, or a different year. In any case, the Blacks and Whites are the focus point of this analysis, so let’s just continue.
It would be now a good time to make some plots.
z = aggregate(a$id, list("Race"=a$race, "Year"=format(a$date, "%Y")), FUN=length)
boxplot(z$x~z$Race, range=0, xlab="", ylab="Number of people killed in a year", main="Raw numbers of people killed by police")
barplot(height=unlist(summary(a$race)/b[b$Year==2018,c(2:6)] *1000000), width=unlist(b[b$Year==2018,c(2:6)]), space=0.05,
xlab="U.S. population", ylab="Rate of police killings per million of population")
Now I see that the rate for the Unknown
group is exceptionally high, but I don’t really know much about that group. Perhaps it was a good idea to ignore it after all. Let’s do that.
a = a[a$race!="Unknown",]
a$race = factor(a$race)
summary(a$race)
barplot(height=unlist(summary(a$race)/b[b$Year==2018,c(2:5)] *1000000), width=unlist(b[b$Year==2018,c(2:6)]), space=0.05,
xlab="U.S. population", ylab="Rate of police killings per million of population", col=c("red","gray","gray","gray"))
And now this plot more or less resembles the plot found in the article, strongly suggesting that the Black americans are being killed by the police at a much higher rate than other races. Such was the article’s conclusion.
There is one minor methodological issue that bothers me in this plot: the population
is included in both x and y axis. Remember that the rate (y-axis) is total-of-killings/population, and then we have population again on the x-axis. However, in the end it is the height of the bars what human eye notices immediately, and that’s where the point is located, so no big deal about that.
Discussion¶
The information shown in the plot is obviously mathematically correct, but is it conveying the message we think it is? The situation displayed by the above plot, with all the underlying assumptions, can be interpreted by this scenario:
“A policeman goes out to a random neighbourhood, pulls out a gun and shoots at random people. On average, per million of 2018 population, over 5 years, he kills 34 black people and 13 white people”.
Do you think this is real? The issues with this scenario are the following:
- In reality, policemen do not just go out to random neighbourhoods – they go more frequently to areas where there’s more crime, and less frequently to areas where there’s less crime.
- In reality, policemen do not just draw guns without a reason. At least a suspicion of a crime or a dangerous situation requiring the use of force usually exists before the gun is being pulled.
- In reality, policemen do not shoot just anywhere, killing people at random. They aim and shoot at people that they believe are endangering someone.
I think these three issues can be folded into a single term: crime rate. The amount of high-crime neighbourhoods and the amount of violent crimes would have to be independent of race. The message from the Washington Post article would be therefore correct under the assumption that the crime rates are identical across races. Are they? Let’s verify.
Enter crime rates¶
Quick google search took me to the website of US Department of Justice, Bureau of Justice Statistics. The Bureau publishes annual reports titled “Criminal Victimization”. These reports, however, contain a lot more information than suggested by the title.
In any case, the report for 2018 is located here: https://www.bjs.gov/content/pub/pdf/cv18.pdf. I am using 2018 as it is the latest available year – and it happens to be the same year as population info from KFF.org used above, so we will have a good comparison. There in the Table 12 I found numbers of violent crimes by race – which I believe is the best measure for situations in which police officers would draw their guns and shoot. Of course, I took the numbers of offenders, not victims.
Again, there is no programmatic access to this data (that I know of), so I just copy/pasted the numbers from the report:
offenses = c("Black"=1155670, "Hispanic"=767560, "White"=2669900, "Other"=131120+480290+115800)
offenses
killings = summary(a$race)
killings = killings[c("Black","Hispanic","White","Other")]
killings
Calculation of rate of killings adjusted for crime rate¶
Now, since we have the numbers of people killed (over 5 years) as well as estimated numbers of violent offenders the year of 2018, we can calculate the corrected rate of police killings. I needed to multiply these numbers by 100,000 population to get the right scale.
round((killings/offenses)*100000,1)
barplot((killings/offenses)*100000, col=c("gray","red","gray","gray"), ylab="Police killings per 100,000 offenders", space=0.05)
The way to interpret these numbers is this: on average, across the USA, for the years 2015-2020, for every 100,000 violent offenders, 112 black people, 118 hispanic people and 93 white people were killed by the police. If anyone wanted to insist that police is targeting any race – it would be Hispanics, not Blacks. But the differences between groups are small enough to to say that the rates are actually quite similar.
Conclusion¶
What can I say? Once you take into account that police does not just kill people at random, but rather those who are in trouble – in this way or the other – with the law, then it turns out the police kills white americans at a rate quite similar to that of black americans. This is according to the Washington Post data, and data obtained from Dept of Justice.
Therefore, the original conclusion from the article, that “Black Americans are killed at a much higher rate than white Americans” is false and unsupported by the numbers. Furthermore, the Washington Post analysis appears professionally done, therefore done by people with knowledge on how to do statistics properly, but also with knowledge on how to manipulate it. It is quite likely, in my opinion, that the analyzed article is a deliberate manipulation.
In the end, the final conclusion is this: Black Americans are NOT being killed at a higher rate than White Americans.
Final remarks¶
I will continue to look at this data from various angles – this report is just a very simple beginning. I will also look for other sources of data to see if perhaps the Washington Post dataset is not biased.
To be continued…