Using Overview to analyze 4500 pages of documents on security contractors in Iraq

This post describes how we used a prototype of the Overview software to explore 4,500 pages of incident reports concerning the actions of private security contractors working for the U.S. State Department during the Iraq war. This was the core of the reporting work for our previous post, where we reported the results of that analysis.

The promise of a document set like this is that it will give us some idea of the broader picture, beyond the handful of really egregious incidents that have made headlines. To do this, in some way we have to take into account most or all of the documents, not just the small number that might match a particular keyword search.  But at one page per minute, eight hours per day, it would take about 10 days for one person to read all of these documents — to say nothing of taking notes or doing any sort of followup. This is exactly the sort of problem that Overview would like to solve.

The reporting was a multi-stage process:

  • Splitting the massive PDFs into individual documents and extracting the text
  • Exploration and subject tagging with the Overview prototype
  • Random sampling to estimate the frequency of certain types of events
  • Followup and comparison with other sources

Splitting the PDFs
We began with documents posted to DocumentCloud — 4,500 pages worth of declassified, redacted incident reports and supporting investigation records from the Bureau of Diplomatic Security. The raw material is in six huge PDF files, each covering a six-month range, and nearly a thousand pages long.

Overview visualizes the content of a set of  “documents,” but there are hundreds of separate incident reports, emails, investigation summaries, and so on inside each of these large files. This problem of splitting an endless stack of paper into sensible pieces for analysis is a very common challenge in document set work, and there aren’t yet good tools. We tackled the problem using a set of custom scripts, but we believe many of the techniques will generalize to other cases.

The first step is extracting the text from each page. DocumentCloud already does text recognition (OCR) on every document uploaded, and the PDF files it gives you to download have the text embedded in them. We used DocumentCloud’s convenient docsplit utility to pull out the text of each page into a separate file, like so:

docsplit text –pages all -o textpages january-june-2005.pdf

This produces a series of files named january-june-2005_1.txt, january-june-2005_2.txt etc. inside the textpages directory. This recovered text is a mess, because these documents are just about the worse possible case for OCR: many of these documents are forms with a complex layout, and the pages have been photocopied multiple times, redacted, scribbled on, stamped and smudged. But large blocks of text come through pretty well, and this command extracts what text there is into one file per page.

The next step is combining pages into their original multi-page documents. We don’t yet have a general solution, but we were able to get good results with a small script that detects cover pages, and splits off a new document whenever it finds one. For example, many of the reports begin with a summary page that looks like this:

Our script detects this cover page by looking for “SENSITIVE BUT UNCLASSIFIED,” “BUREAU OF DIPLOMATIC SECURITY” and “Spot Report” on three different lines. Unfortunately, OCR errors mean that we can’t just use the normal string search operations, as we tend to get strings like “SENSITIZV BUT UNCLASSIEIED” and “BUR JUDF DIPLOJ>>TIC XECDRITY.” Also, these are reports typed by humans and don’t have a completely uniform format. The “Spot Report” line in particular occasionally says something completely different. So, we search for each string with a fuzzy matching algorithm, and require only two out of these three strings to match.

We found about 10 types of cover pages in the document set, each of which required a different set of strings and matching thresholds. But with this technique, we were able to automatically divide the pages into 666 distinct documents, most of which contain material concerning a single incident. It’s not perfect — sometimes cover pages are not detected correctly, or are entirely missing — but it’s good enough for our purposes.

The pre-processing script writes the concatenated text for each extracted document into one big CSV file, one document per row. It also writes out the number of pages for that document, and a document URL formed by adding the page number to the end of a DocumentCloud URL. If you can get your document set into this sort of CSV input format, you can explore it with the Overview prototype.

Exploring the documents with Overview

The Overview prototype comes in two parts: a set of Ruby scripts that do the natural language processing, and a document set exploration GUI that runs as a desktop Java app. Starting from iraq-contractor-incidents.csv, we run the preprocessing and launch the app like this,

./ iraq-contractor-incidents

./ iraq-contractor-incidents

Overview has advanced quite a bit since the proof-of-concept visualization work last year, and we now have a prototype tool set with a document set exploration GUI that looks like this (click for larger)

Top right is the “items plot,” which is an expanded version of the prototype “topic maps” that we demonstrated in our earlier work visualizing the War Logs. Each document is a dot, and similar documents cluster together. The positions of the dots are abstract and don’t correspond to geography or time. Rather, the computer tries to put documents on about similar topics close together, producing clusters. It determines “topic” by analyzing which words appear in the text, and how often.

Top left is the “topic tree”, our new visualization of the same documents. It’s based on the same similarity metric as the Items Plot, but here the documents are divided into clusters and sub-clusters.

The computer can see that categories of documents exist, but it doesn’t know what to call them. Nor do the algorithmically-generated categories necessarily correspond to the way a journalist might want to organize them. You could plausibly group incidents by date, location, type of event, actors, number of casualties, equipment involved, or many other ways.

For that reason, Overview provides a tagging interface (center) so that the user can name topics and group them in whatever way makes sense. The computer-generated categories serve as a starting point for analysis, a scaffold for the journalist’s exploration and story-specific categorization. In this image, the orange “aircraft” tag is selected, and the selected documents appear in the topic tree, the items plot, and as a list of individual documents. The first of these aircraft-related documents is visible in the document window, served up by DocumentCloud.

Random sampling
It took about 12 hours to explore the topic tree, assign tags and create a categorization that we felt suited the story. The general content of the document set was clear pretty quickly. At some point, there’s no way around a reporter reading a lot of documents, and Overview is really just a structured way to choose which documents to read. It’s a shortcut, because after you look at a few documents in a cluster and discover that they’re all more or less the same type of incident, you usually don’t really need to read the rest.

This process produces an intuitive sense of the contents of a document set. It’s key to finding the story, but it doesn’t provide any basis for making claims about how often certain types of events occurred, or whether incidents of one type really differed from incidents of another type. For example, we found that the incidents mostly involved contractors shooting at cars that got too close to diplomatic motorcades. But what does “mostly” mean? Is it a majority of the incidents? Do we need to look more closely at the other material, or does this cover 90 percent of what happened?

In principle, to answer this type of general question you’d need to read every single document, keeping a count of how many involved “agressive vehicles,” as they are called in the reports. Dividing that count by the total number of documents gives the percentage. Reading every document is impractical, but there’s an excellent shortcut: random sampling.

Random sampling is like polling: ask a few people, and substitute their results for the whole population. The randomization ensures that you don’t end up polling a misrepresentative group. For example, if all of the sample documents we choose to look at come from a pile which contains much more “agressive vehicle” incidents than average, obviously our percentages will be skewed. For this reason, Overview includes a button that chooses a random document from among those currently selected. If you first select all documents, this is a random sample drawn from the entire set.

We used a random sample of 50 out of the 666 documents to establish the factual basis of the following statements in our report:

  • The majority of incidents, about 65 percent, involve a contractor team assigned to protect a U.S. motorcade firing into an “aggressive” or “threatening” vehicle.
  • there is no record of followup investigations in an estimated 95 percent of the reports.
  • About 45 percent of the reports describe events happening outside of Baghdad.
  • Our analysis found that only about 2 percent of the 2007 motorcades in Iraq resulted in a shooting.

Each of these is a statement about a proportion of something, and the sampling gives us numerical estimates for each. Along with their associated sampling errors, these figures are strong evidence that the statements above are factually correct. (The relevant sampling error formula is for “proportion estimation from a random sample without replacement,” and gives a standard error of about ±5% for our sample size.)

We also used sampling to estimate the number of incidents of contractor-caused injury to Iraqis that we might not have found. During the reporting process we found 14 such incidents (1,2,3,4,5,6,7,8,9,10,11,12,13,14) but keyword search is not reliable for a variety of reasons. For example it is based on the scanned text, which is very error-prone. Could we be missing another few dozen such incidents? We can say with high probability that the answer is no, because we independently estimated the number of such incidents using our sample, and found it to be 2% ±2% out of 666, or most likely somewhere between 0 and 26 documents, with an expected value of 13. So while we are almost certainly missing a few incidents, it’s very unlikely that we’re missing more than a handful.

Other sources
Documents never tell the whole story; they’re simply one source, ideally one source of many. For this story, we first consulted with AP reporter Lara Jakes, who has been covering events from Baghdad for many years, and has written about private security contractors in particular. She provided a crucial reality check to make sure we understood the complex environment that the documents referred to. We also looked at many other document sources, including the multitude of lengthy government reports that this issue has generated over the years.

We then set up a call with the Department of State. Undersecretary for Management Patrick Kennedy spent almost an hour on the phone with us, and his staff worked hard to answer our followup questions. In addition to useful background information, they provided us with the number of cases concerning security contractor misconduct that the State Department has referred to the Department of Justice: five. They also told us that there were 5,648 protected diplomatic motorcades in Iraq in 2007. These figures add crucial context to the incident counts we were able to pull out of the document set, and we do not believe that either has been been previously reported.

Finally, we searched news archives and other sources, such as the Iraq Body Count database, to see if the incidents of Iraqi injury we found had been previously reported. Of  the fourteen incidents, four appear to have been documented elsewhere. We believe this document and this news report refer the same incident, as well as this and this, and we suspect also this is the same as record d0233 in the Iraq Body Count database, while this matches record d4900. Of course, there may be other records of these events, but after this search we suspect that many of the incidents we found were previously unreported.

Next steps
This is the first major story completed using Overview, which is still in prototype form. We learned a lot doing it, and the practical requirements of reporting this story drove the development of the software in really useful ways. The code is up on GitHub, and over the next few weeks we will be releasing training materials which we hope will allow other people to use it successfully. We will also hold a training session at the NICAR conference this week. The software itself is also being continually improved. We have a lot of work to do.

Our next step is actually a complete rewrite, to give the system a web front end and integrate it with DocumentCloud. This will make it accessible to many more people, since many journalists already use DocumentCloud and a web UI means there is nothing to download and install. We’re hiring engineers to help us do this; for details on the plan, please see our job posting.