Why We Need to be Data Detectives

Heather Krause
5 min readJun 7, 2019

In last week’s email newsletter I mentioned Naomi Wolf’s data problems (That’s Naomi Wolf — not Naomi Klein as I accidentally wrote in part of the text. Apologies to Naomi Klein fans!) In Wolf’s latest book she makes claims about the number of men executed in England in the 1800s and cites data from the Old Bailey (one of the main courts at that time). In an interview with the BBC, Wolf was fact-checked live and it turned out that her interpretation of the data was incorrect. She did not understand the meaning of the data she was looking at and as a result her research and claims were unfounded. is the actual audio of the conversation, the most directly related portion starts around the 19 minute mark.

I wrote that if Naomi Wolf had built a data biography for her work, she could have avoided this problem. A really insightful colleague, Carolina Roe-Raymond from Rutgers University (Rutgers is also where Naomi is the Gloria Steinem Chair in Media, Culture, and Feminist Studies , so I initially assumed it would be another correction of my typo) wrote back that she didn’t think that our data biography template would have actually prevented this problem. We spent a lot of time talking and thinking about this during the week in our office and we realized Dr. Roe-Raymond was correct. And, with many thanks to her, we’ve updated the Data Biography Template.

The issue of misunderstanding, misinterpreting, or making mistakes around the accurate meaning of specific categories of coded variables happens a lot. In Wolf’s case she thought that the category “death recorded” in the variable “Sentence” meant that the person had been executed. What it actually means is that the judge was abstaining from voicing a sentence of capital punishment in cases where the judge believed that a royal pardon, which was very common at the time for many crimes, would be forthcoming if a proper death sentence were to be issued. It was a way to formally issue a death sentence without having to actually execute the person.

It does not feel outrageous that Naomi Wolf assumed that “death recorded” meant literally what it says. She didn’t really have a way to know that she was misinterpreting the data because the data seemed to be spelling out its meaning so literally. However, that’s exactly the point of a data biography — to check the assumptions you’re making about the data. I once worked on a project where I assumed the variable “Group Gender” meant the gender composition of the work groups. When we got on the ground, the gender compositions didn’t match at all. Turns out that variable reflected the enumerator’s opinion of who in the group was doing the bulk of the group’s work. So, for example, in the ‘Group Gender’ category ‘80% women’ did not mean that the group was composed of 80% women but rather that the women in the group were perceived to be doing 80% of the work.

My mistake and Dr. Wolf’s mistake stem from a lack of detailed understanding of the context of the variables we were using. Like bad data detectives we accepted the first clue at face value. By constructing a detailed data biography we might have caught the mistake earlier. The variable “Sentence” was capturing one type of data while the variable “Sentence Outcome” was capturing a different type of data. The variable “Sentence” was the official sentence handed down while the variable “Sentence Outcome” was capturing the actual action taken. In the case of Thomas Silver, one of the records Dr. Wolf used in her research, the Sentence was death while the Sentence Outcome was imprisonment. He was imprisoned for three years rather than executed.

Now, to the Data Biography Template — this is where Carolina’s insight has been invaluable. In our Data Biography Template, there is a section called “What” that is about what data is being collected. In the template, we spell out specific meanings that you need to check on such as how the data is being collected and what it means. For example, are the categories trans woman and cis woman being assumed to be combined into a category of woman? Another example is in the violence against women example, we ask the data biography what categories of violence are included and lumped together? Is emotional violence considered to be violence in this data? For example.

However, this addressed variables at the top level and works better for a simple variable like binary yes/no categories. An example might be if the respondent is ‘Elderly’ as a variable, the data bio information might explain that the survey counts people aged 70+ as elderly. In a situation where many variables exist within a category, we realized that we needed a place to contextualize each possible response. We’ve added a more detailed section to the Data Biography Template where we suggest the documentation of the specifics of the key variables being used in your data product. Similar to how part of the data biography is like “thick metadata” this new section is like a “thick codebook”.

Here is a link to our updated Data Biography Template that now includes space for details about all the individual variables used in the data product. The short version of the template remains the same and can be found here.

As Carolina pointed out, if a category is confusingly titled or vaguely explained just having a slot for info about it won’t automatically catch these mistakes. The way that this tool can be effective is if we never put in any ASSUMED information. Sure you think ‘80% women’ means that 8 out of 10 group members are women, but sometimes it turns out to be one really hardworking woman. Getting the facts about what your variables mean will avoid unpleasant mistakes, and having a best practice of always having that detailed level of information will be a great red flag system to help you catch omissions, errors, or outright falsehoods.

As always, we really love getting your feedback. It helps us make better tools for you and together we can help raise the levels of equity and ethics in all our data products.

Originally published at https://weallcount.com on June 7, 2019.

The We All Count project share examples, build tools, and provide training and education aimed at helping better understand data — so we can make it more transparent and fairer for everyone. Because when you do the math, we all count.

Sign up for our newsletter ‘The Lowdown’. Delivering plain language, no-jargon, in-depth articles about current issues in data science equity; video tutorials on how to find and correct these errors; interactive web tools that can help you bring your data game to the next level; and a deep dive into the ‘Data Life Cycle’, our keystone concept for thinking about the future of data.

Follow us: Facebook | Twitter

--

--

Heather Krause

Data scientist & statistician (one of only 150 accredited PStats worldwide). Providing data science services grounded in an equity lens. https://weallcount.com