Document mining shows Paul Ryan relying on the the programs he criticizes

One of the jobs of a journalist is to check the record. When Congressman Paul Ryan became a vice-presidential candidate, Associated Press reporter Jack Gillum decided to examine the candidate through his own words. Hundreds of Freedom of Information requests and 9,000 pages later, Gillum wrote a story showing that Ryan has asked for money from many of the same Federal programs he has criticized as wasteful, including stimulus money and funding for alternative fuels.

This would have been much more difficult without special software for journalism. In this case Gillum relied on two tools: DocumentCloud to upload, OCR, and search the documents, and Overview to automatically sort the documents into topics and visualize the contents. Both projects are previous Knight News Challenge winners.

But first Gillum had to get the documents. As a member of Congress, Ryan isn’t subject to the Freedom of Information Act. Instead, Gillum went to every federal agency — whose files are covered under FOIA — for copies of letters or emails that might identify Ryan’s favored causes, names of any constituents who sought favors, and more.

Bit by bit, the documents arrived — on paper. The stack grew over weeks, eventually piling up two feet high on Gillum’s desk. Then he scanned the pages and loaded them into the AP’s internal installation of DocumentCloud. The software converts the scanned pages to searchable text, but there were still 9000 pages of material.

That’s where Overview came in. Developed in house at the Associated Press, this open-source visualization tool processes the full text of each document and clusters similar documents together, producing a visualization that graphically shows the contents of the complete document set.

“I used Overview to take these 9000 pages of documents, and knowing there was probably going to be a lot of garbage or extra attachements, to separate the chaff from the wheat,” said Gillum. Much of Ryan’s correspondence is standard congressional work, communicating with constituents about their particular problems and issues. “I could figure out where are the letters from voters, and to to put these documents in groups. So if someone’s complaining about the FCC, and there are 200 pages about that, we can put that aside.”

DocumentCloud supports key word search, but search won’t always tell the full story. First, much of the material was of such low quality, such as copies of faxed letters, that the OCR process that converts a scanned image into searchable text often produced incorrect results. This means that a literal search will miss documents. Second, searching will not help you find stories that you don’t know you are looking for, a problem that gets worse as the number of documents grows. You need something like a table of contents to avoid that problem, which is what Overview provides.

In this case, Overview was able to group letters signed by Ryan, by recognizing certain standardized language in the header and footer, even when that text was sometimes garbled by the OCR process.  “It found a cluster of the documents that Ryan had written over his signature,” said Gillum.

Tools like DocumentCloud and Overview are rapidly becoming essential as reporters are forced to deal with ever increasing amounts of information. It is not unusual for a single request for government files to produce thousands or even tens of thousands of pages of material, far too much to read exhaustively.

“Using these sorts of tools is essential as we go forward, looking at big document sets, to provide readers with some insight into how government works,” said Gillum.

“I’m not going to sit out on the newsroom floor and sort pages into stacks of documents,” he said.