The document mining Pulitzers

Four of the winners and finalists of the 2014 Pulitzer prize in journalism were based on reporting from a large volume of documents. Journalists have relied on bulk documents since long before computers — as in the groundbreaking work of I. F. Stone — but document-driven reporting has blown up over the last few years, a confluence of open data, better technology, and the era of big leaks.

What I find fascinating about these stories is that they show such different uses and workflows for document mining in journalism. Several of them suggest directions that Overview should go.

Overview used for Pulitzer finalist story

We were delighted to learn that Overview was used on the sole finalist in the Public Service category, an investigation into New York state secrecy laws that hide police misconduct. Adam Playford and Sandra Peddie of Newsday reported the series over many months, and Playford wrote about his use of Overview for Investigative Reporters and Editors:

We knew early in our investigation of Long Island police misconduct that police officers had committed dozens of disturbing offenses, ranging from cops who shot unarmed people to those who lied to frame the innocent. We also knew that New York state has some of the weakest oversight in the country.

What we didn’t know was if anyone had ever tried to change that. We suspected that the legislature, which reaps millions in contributions from law enforcement unions, hadn’t passed an attempt to rein in cops in years. But we needed to know for sure, and missing even one bill could change the story drastically.

Luckily, I’d been playing with Overview, a Knight Foundation-funded Associated Press project that highlights patterns within piles of documents. Overview simplified my task greatly — letting me do days’ worth of work in a few hours.

Almost instantly, Overview scanned the full text of all 1,700 bills and created a visualization that split the bills into dozens of groups based on the most unique words that appeared in each bill. This gave me an easy way to skim through the bills in each group by title.

Ultimately, Playford and Peddie were able to prove that lawmakers had never addressed the problem. This is a tremendous story, exactly the sort of classic accountability reporting that journalism is supposed to be about.

The document mining process is particularly interesting because the reporters needed to prove that something was not in the documents, which is not a task that we envisioned when we began building Overview. But because Overview clusters similar documents, they were able to complete an exhaustive review much more quickly because they could often discard obviously unrelated clusters.

The Snowden files: software is an obvious win

The winner in the Public Service category was a blockbuster: The Guardian and The Washington Post won for their ongoing coverage of the NSA documents. Snowden has never publicly said exactly how many documents he took with him, but The Guardian says it received 58,000 at one point. This story typifies a new breed of document-driven reporting of that has emerged in the last few years: a large leak, already in electronic form, with very little guidance about what the stories might be.

Finding what you don’t know to look for is an especially tricky problem with such unasked-for archives — as opposed to the result from a large FOIA request, where the journalist knows why they asked for the documents. Clearly you need software of some sort to do reporting on this type of material.

The Al-Qaida papers: we need a workflow breakthrough

Rukmini Callimachi of the Associated Press was a finalist for her stories based on a cache of Al-Qaida documents found in the remote town of Timbuktu, Mali. Thousands of documents were found strewn through a building that the fighters had occupied for more than a year. Unlike the digital Snowden files these were paper materials, several trash bags full of them, which had to be painstakingly collected, cataloged, scanned, and translated.

Through these documents we’ve learned of Al-Qaida’s shifting strategy in Africa, their tips for avoiding drones, and that members must file expense reports.

We had a chance to talk through the reporting process with Callimachi, and we’ve come to the conclusion that the bottleneck in such “random pile of paper” stories is the preservation, scanning and translation process. The reporters on the similarly incredible Yanuleaks documents — thrown into a lake by ousted Ukrainian president Viktor Yanukovych — face similar challenges. I still believe in the power of good software to accelerate these types of stories, but we need a breakthrough in workflow and process, rather than language analysis algorithms. Could we integrate scanning and translation into a web app? Maybe using a phone scanning stand, and a combination of computer and crowdsourced translation?

America’s underground adoption market: counting cases

The final document-driven Pulitzer finalist covered a black market for adopted children. Megan Twohey of Reuters analyzeed 5000 posts from a newsgroup where parents traded children adopted from abroad.

I’ve started calling this kind of analysis a categorize and count story. Reuters reporters created an infographic to go with their story, summarizing their discoveries: 261 different children were advertised over that time, from 34 different states. Over 70% of these children had been born abroad, in at least 23 different countries. They also documented the number of cases where children were described as having behavioral disorders or being victims of physical or sexual abuse.

When combined with narratives of specific cases, these figures tell a powerful story about the nature and scope of the issue.

5000 posts is a lot to review manually, but that’s exactly what Twohey and her collaborators did. Two reporters independently read each post and recorded the details in a database, including the name and age of the child, the email address of the post author, and other information. After  a lengthy cleaning process they were able to piece together the story of each child, sometimes across many years and different poster pseudonyms.

There may be an opportunity here for information extraction or machine learning algorithms: given a post, extract the details such as the child’s name, and try to determine whether the text mentions other information such as previous abuse.

But no one has really tried applying machine learning to journalism in this way. We hope to add semi-automated document classification features to Overview later this year, because it’s a problem that we see reporters struggling with again and again.

We’ll see more of this

I’m going to end with a guess: we’ll see several document-driven stories in next year’s Pulitzers, because we’ll see many more such stories in journalism in general. I make this prediction by looking at several long term trends. Broadly speaking, the amount of data in the world is rapidly increasing. Open data initiatives offer a particular trove for journalists because they are often politically relevant, and governments across the world are getting better at responding to Freedom of Information Requests (even in China.) At the same time, we’ve entered the era of the mega-leak: an entire country’s state secrets can now fit on a USB drive.

Much of this flood is unstructured data: words instead of numbers. Some technologists argue that we can and should eventually turn all of this into carefully coded databases of events and assertions. While such structuring efforts can be very valuable, we can expect that unstructured text data will always be with us because of its unique flexibility. Emails, instant messages, open-ended survey questions, books: ordinary human language is the only format that seems capable of expressing the complete range of human experience, and that is what journalism is ultimately about.

Meanwhile the technology for reporting on bulk unstructured material is improving rapidly. Overview is a part of that, and we’re aiming specifically at users in journalism and other traditionally under-resourced social fields. Between greater data availability and better tools, I have to imagine that we’ll see a lot more document-driven journalism in the future. To my eye, this year’s Pulitzers are a reflection of larger trends.

[Updated 2014-5-6 with a more accurate description of the reporting process for the Reuters story.]