It’s often hard for people to understand what exactly I do. When I say that I’m a “data journalist” and that I “tell stories using data”, I usually get a blank stare or follow-up questions along the lines of, “what types of data?” or “what do you mean by ‘tell stories’?”
Here’s my attempt to answer the first question, “What types of data?”
I tend to group “data” into four broad categories:
Data that must be collected point by point
Information that would not exist if you didn’t go out and get it yourself. For example, there is no comprehensive list of people killed by police in the U.S. (FBI and CDC estimates are incomplete). The Guardian and the Washingon Post set out to change that. The Guardian created a database of police killings. The Post compiled its own database of police shootings using a different accounting method.
Data that exists in some form but is not readily available, either because you have to ask for it or because it is private/secret
This could include information obtained through the Freedom of Information Act (FOIA). For example, a list of L.A. buildings that were overdue for a fire inspection. The L.A. Times obtained this information by making a public records request to the LAFD.
It could also include data released through leaks/obtained from confidential sources. For example, the Panama Papers database, documents detailing offshore shell companies held with the firm Mossack Fonseca and obtained by the German newspaper Süddeutsche Zeitung through an anonymous source.
Data that you don’t have to request/get from anybody, but that must be otherwise extracted, cleaned up, merged, or otherwise formatted correctly for analysis
This category could easily overlap with the previous one, since raw data is rarely fit to analyze before some re-structuring. Data extraction happens a lot, too: the data behind the LAT’s graphic of every shot Kobe Bryant ever took was gathered by scraping information from the NBA’s webpage.
Also in this category are datasets that merge information from several sources in new, interesting ways. For example, Bloomberg’s piece, “This is How Fast America Changes Its Mind”, determined how long it took for state (and in the case of the first five issues, federal) legislation to allow interracial marriage, prohibition, women’s suffrage, abortion, same-sex marriage, and recreational marijuana. Each issue used information from one or two different sources.
Data that is out there, waiting for you to analyze
This sounds like the unicorn category, but it’s not as uncommon as you’d think. Sometimes, news stories take an angle on publicly released datasets, like this one from FiveThirtyEight, which used the latest jobs report from the Bureau of Labor Statistics. Other times, somebody else has done the “creating the dataset” part for you. Buzzfeed did a story last year on race and police killings that used some of The Guardian and the Washington Post’s data.
If you’re lucky, someone else has done most of the “cleaning & wrangling” part, too. The California Civic Data Coalition, whose new site launched today, is a perfect example: this partnership of journalists and computer programers has cleaned and documented more than 70,000 rows of California’s very messy campaign finance and lobbying data. This information is ripe for analysis.
These categories aren’t exhaustive or mutually exclusive, nor are they only applicable to news stories. For me, they’re a helpful starting point for getting into a data mindset.
Of course, it’s not just about the data; journalism is all about finding and telling the story. A lot of the time, that’s the trickiest part. I’ll write about that soon - stay tuned!