Large sets of documents have been central to some of the biggest stories in journalism, from the Pentagon Papers to the Enron emails to the NSA files. But what, exactly, do journalists do with all these documents when they get them? In the course of developing Overview we’ve talked to a lot of journalists about their document mining problems, and we’ve discovered that there are several recurring types of document-driven story.
The smoking gun: searching for specific documents
In this type of work the journalist is trying to find a single document, or a small number of documents, that make the story. This might be the memo that proves corruption, the purchase order that shows money changed hands, or the key paragraph of an unsealed court transcript.
Jarrel Wade’s story about the Tulsa police department spending millions of dollars on squad car computers that didn’t work properly began with a tip. He knew there was an ongoing internal investigation, but little more until his FOIA request returned almost 7000 pages of emails. His final story rested on a dozen key documents he discovered by using Overview to rapidly and systematically review every page.
This story demonstrates several recurring elements of smoking gun stories. Wade had a general idea what the story was about, which is why he asked for the documents in the first place. But he didn’t know exactly what he was looking for, which makes text search difficult. Keyword search can also fail if there are many different ways to describe what you’re looking for, or if you need to look for a word that has several meanings — imagine searching for “can” meaning container, and finding that almost every document contains “can” meaning able. Even worse, OCR errors can prevent you from finding key documents, which is why Overview supports fuzzy search.
The trend story: getting the big picture
As opposed to a smoking gun story where only specific documents matter, a trend story is about broad patterns across many documents. A comprehensive analysis of a comprehensive set of documents makes a powerful argument.
For my story about private security contractors in Iraq I wanted to go beyond the few high-profile incidents that had made headlines during the height of the war. I used Overview to analyze 4,500 pages of recently declassified documents from the U.S. Department of State in order to understand the big picture questions. What were the day-to-day activities of armed private security contractors in Iraq? What kind of oversight did they have? Did contractors frequently injure civilians, or was it rare?
Overview showed the broad themes running across many documents in this unique collection of material. Combined with searches for specific types of incidents and a random sampling technique to back my claims with numbers, I was able to tell a carefully documented big picture story about this sensitive issue.
Categorize and count: turning documents into data
Some trend stories depend on hard numbers: of 367 children in foster care, 213 were abused. 92% of the complaints concerned noise. The state legislature has never once proposed a bill to address this problem. This type of story involves categorizing every document according to some scheme. Both the categories you decide to use and the number of documents in each category can be important parts of the story.
For their comprehensive report on America’s underground market for adopted children, Ryan McNeill, Robin Respaut and Megan Twohey of Reuters analyzed more than 5000 messages from a Yahoo discussion group spanning a five year period. They created an infographic summarizing their discoveries: 261 different children were advertised over that time, from 34 different states. Over 70% of these children had been born abroad, in at least 23 different countries. They also documented the number of cases where children were described as having behavioral disorders or being victims of physical or sexual abuse. When combined with narratives of specific cases, these figures tell a powerful story about the nature and scope of the issue.
Overview’s tagging interface is well suited to categorize-and-count tasks. Even better, it can take much of the drudgery out of this type of work because similar documents are automatically grouped together. When you’re done you can export your tags to produce visualizations. We are planning to add machine learning techniques to Overview so that you can teach the system how you want your documents tagged.
Wheat vs. chaff: filtering out irrelevant material
Sometimes a journalist gets ahold of a lot of potentially good material, but you can’t publish potential. Some fraction of the documents are interesting, but before a reporter can report they have to find that fraction.
In the summer of 2012 Associated Press reporter Jack Gillum used FOIA requests to obtain over 9,000 pages of then-VP nominee Paul Ryan’s correspondence with over 200 Federal agencies. He used Overview to find the material that Ryan had written himself, as opposed to the thousands of pages of attachments and other supporting documents. By analyzing Ryan’s letters he was able to show that the Congressman was privately requesting money from many of the same government programs he had publicly criticized as wasteful.
This type of analysis is somewhere between the smoking gun and trend stories: first find the interesting subset of documents, then report on the trends across that subset. Overview was able to sort all of Ryan’s correspondence into a small number of folders because it recognized that the boilerplate language on his letterhead was shared across many documents.
What do you need to do with your documents?
These are the patterns we’ve seen, but we’re also discovered that there are huge variations. Every story has unique aspects and challenges. Have you done an interesting type of journalistic document set analysis? Do you have a problem that you’re hoping Overview can solve? Contact us.