Next steps for development, and a job posting

With our recent analysis of Iraq security contractor documents, the Overview prototype has been used for its first real story. But our prototype is just that: a proof-of-concept tool, built as quickly as possible to validate certain algorithms and approaches. The next step is to create a solid architecture for future work. We need to make this technology web-deployable, scalable, and integrated with DocumentCloud.

If you haven’t already, take a look at our writeup of how we used the Overview prototype for our Iraq security contractors work. We started with documents posted to DocumentCloud, then downloaded the original PDF files for processing with a series of Ruby scripts. After processing, we used the prototype visualization interface, written in Java, to find topics and tag documents in bulk according to their subject. We’d like to streamline this whole process, so that Overview works like this:

  • Upload raw material to DocumentCloud.
  • Select documents for exploration in Overview, by using the DocumentCloud project and search functions.
  • Launch Overview, directly in the browser. Uses the visualization tools to explore the set, create subject tags, and apply them to the documents.
  • Export Overview’s tags back into native DocumentCloud tags and annotations.

In short, we want to tightly integrate Overview’s semantic visualization with DocumentCloud’s storage, search, viewing, annotation, and management tools. This means that Overview has to have a web front end, which means the interface needs to be Javascript, not Java. We also suspect that for performance reasons, the visualizations will need to be rendered in WebGL. On the back end, Ruby is just too slow for natural language processing  For example our common bigram detection code (which helps Overview discover frequently used two-word phrases) takes several minutes to build an intermediate table with hundreds of thousands of elements for the 4,500 pages of the Iraq contractor set. We’d like the Overview architecture to scale to millions of pages — a thousand times larger, which would take days with the current algorithm. So the server-side processing needs to be implemented in a higher performance language, such as Java.

The good news is that Overview uses one of the same basic data structures as search engines, a TF-IDF weighted index. DocumentCloud uses the popular Solr search platform, so integration with DocumentCloud will also pave the way for integration with any application which is based on Solr. That’s a lot of possible applications.

Given all of the above, this is the current planned order of  development tasks, which we think could be accomplished in about a year by a competent engineer: Rewrite the prototype with a Java backend and JavaScript/WebGL UI. Integrate the user experience with DocumentCloud’s tagging system. Then integrate with back end the Solr index data structures and APIs. As we go, we’ll collect feedback from our growing tester and user community and decide what to build next — there is a wide range of problems we could address.

We’re hiring two engineers on a full-time basis to accomplish this, perhaps one person who’s more inclined to the user interface, and one who is more into the back end processing. We’re looking for

  • Solid Java or JavaScript engineering experience, preferably 3-5 years of work on large applications.
  • Familiarity with open source development projects.
  • Experience in computer graphics, visualization, natural language processing, or distributed systems a plus.

This is a contract position. We’d prefer if you worked with us out of the AP offices in New York, but we’ll consider remote contributors. Please contact if interested.

Using Overview to analyze 4500 pages of documents on security contractors in Iraq

This post describes how we used a prototype of the Overview software to explore 4,500 pages of incident reports concerning the actions of private security contractors working for the U.S. State Department during the Iraq war. This was the core of the reporting work for our previous post, where we reported the results of that analysis.

The promise of a document set like this is that it will give us some idea of the broader picture, beyond the handful of really egregious incidents that have made headlines. To do this, in some way we have to take into account most or all of the documents, not just the small number that might match a particular keyword search.  But at one page per minute, eight hours per day, it would take about 10 days for one person to read all of these documents — to say nothing of taking notes or doing any sort of followup. This is exactly the sort of problem that Overview would like to solve.

The reporting was a multi-stage process:

  • Splitting the massive PDFs into individual documents and extracting the text
  • Exploration and subject tagging with the Overview prototype
  • Random sampling to estimate the frequency of certain types of events
  • Followup and comparison with other sources

Splitting the PDFs
We began with documents posted to DocumentCloud — 4,500 pages worth of declassified, redacted incident reports and supporting investigation records from the Bureau of Diplomatic Security. The raw material is in six huge PDF files, each covering a six-month range, and nearly a thousand pages long.

Overview visualizes the content of a set of  “documents,” but there are hundreds of separate incident reports, emails, investigation summaries, and so on inside each of these large files. This problem of splitting an endless stack of paper into sensible pieces for analysis is a very common challenge in document set work, and there aren’t yet good tools. We tackled the problem using a set of custom scripts, but we believe many of the techniques will generalize to other cases.

The first step is extracting the text from each page. DocumentCloud already does text recognition (OCR) on every document uploaded, and the PDF files it gives you to download have the text embedded in them. We used DocumentCloud’s convenient docsplit utility to pull out the text of each page into a separate file, like so:

docsplit text –pages all -o textpages january-june-2005.pdf

This produces a series of files named january-june-2005_1.txt, january-june-2005_2.txt etc. inside the textpages directory. This recovered text is a mess, because these documents are just about the worse possible case for OCR: many of these documents are forms with a complex layout, and the pages have been photocopied multiple times, redacted, scribbled on, stamped and smudged. But large blocks of text come through pretty well, and this command extracts what text there is into one file per page.

The next step is combining pages into their original multi-page documents. We don’t yet have a general solution, but we were able to get good results with a small script that detects cover pages, and splits off a new document whenever it finds one. For example, many of the reports begin with a summary page that looks like this:

Our script detects this cover page by looking for “SENSITIVE BUT UNCLASSIFIED,” “BUREAU OF DIPLOMATIC SECURITY” and “Spot Report” on three different lines. Unfortunately, OCR errors mean that we can’t just use the normal string search operations, as we tend to get strings like “SENSITIZV BUT UNCLASSIEIED” and “BUR JUDF DIPLOJ>>TIC XECDRITY.” Also, these are reports typed by humans and don’t have a completely uniform format. The “Spot Report” line in particular occasionally says something completely different. So, we search for each string with a fuzzy matching algorithm, and require only two out of these three strings to match.

We found about 10 types of cover pages in the document set, each of which required a different set of strings and matching thresholds. But with this technique, we were able to automatically divide the pages into 666 distinct documents, most of which contain material concerning a single incident. It’s not perfect — sometimes cover pages are not detected correctly, or are entirely missing — but it’s good enough for our purposes.

The pre-processing script writes the concatenated text for each extracted document into one big CSV file, one document per row. It also writes out the number of pages for that document, and a document URL formed by adding the page number to the end of a DocumentCloud URL. If you can get your document set into this sort of CSV input format, you can explore it with the Overview prototype.

Exploring the documents with Overview

The Overview prototype comes in two parts: a set of Ruby scripts that do the natural language processing, and a document set exploration GUI that runs as a desktop Java app. Starting from iraq-contractor-incidents.csv, we run the preprocessing and launch the app like this,

./ iraq-contractor-incidents

./ iraq-contractor-incidents

Overview has advanced quite a bit since the proof-of-concept visualization work last year, and we now have a prototype tool set with a document set exploration GUI that looks like this (click for larger)

Top right is the “items plot,” which is an expanded version of the prototype “topic maps” that we demonstrated in our earlier work visualizing the War Logs. Each document is a dot, and similar documents cluster together. The positions of the dots are abstract and don’t correspond to geography or time. Rather, the computer tries to put documents on about similar topics close together, producing clusters. It determines “topic” by analyzing which words appear in the text, and how often.

Top left is the “topic tree”, our new visualization of the same documents. It’s based on the same similarity metric as the Items Plot, but here the documents are divided into clusters and sub-clusters.

The computer can see that categories of documents exist, but it doesn’t know what to call them. Nor do the algorithmically-generated categories necessarily correspond to the way a journalist might want to organize them. You could plausibly group incidents by date, location, type of event, actors, number of casualties, equipment involved, or many other ways.

For that reason, Overview provides a tagging interface (center) so that the user can name topics and group them in whatever way makes sense. The computer-generated categories serve as a starting point for analysis, a scaffold for the journalist’s exploration and story-specific categorization. In this image, the orange “aircraft” tag is selected, and the selected documents appear in the topic tree, the items plot, and as a list of individual documents. The first of these aircraft-related documents is visible in the document window, served up by DocumentCloud.

Random sampling
It took about 12 hours to explore the topic tree, assign tags and create a categorization that we felt suited the story. The general content of the document set was clear pretty quickly. At some point, there’s no way around a reporter reading a lot of documents, and Overview is really just a structured way to choose which documents to read. It’s a shortcut, because after you look at a few documents in a cluster and discover that they’re all more or less the same type of incident, you usually don’t really need to read the rest.

This process produces an intuitive sense of the contents of a document set. It’s key to finding the story, but it doesn’t provide any basis for making claims about how often certain types of events occurred, or whether incidents of one type really differed from incidents of another type. For example, we found that the incidents mostly involved contractors shooting at cars that got too close to diplomatic motorcades. But what does “mostly” mean? Is it a majority of the incidents? Do we need to look more closely at the other material, or does this cover 90 percent of what happened?

In principle, to answer this type of general question you’d need to read every single document, keeping a count of how many involved “agressive vehicles,” as they are called in the reports. Dividing that count by the total number of documents gives the percentage. Reading every document is impractical, but there’s an excellent shortcut: random sampling.

Random sampling is like polling: ask a few people, and substitute their results for the whole population. The randomization ensures that you don’t end up polling a misrepresentative group. For example, if all of the sample documents we choose to look at come from a pile which contains much more “agressive vehicle” incidents than average, obviously our percentages will be skewed. For this reason, Overview includes a button that chooses a random document from among those currently selected. If you first select all documents, this is a random sample drawn from the entire set.

We used a random sample of 50 out of the 666 documents to establish the factual basis of the following statements in our report:

  • The majority of incidents, about 65 percent, involve a contractor team assigned to protect a U.S. motorcade firing into an “aggressive” or “threatening” vehicle.
  • there is no record of followup investigations in an estimated 95 percent of the reports.
  • About 45 percent of the reports describe events happening outside of Baghdad.
  • Our analysis found that only about 2 percent of the 2007 motorcades in Iraq resulted in a shooting.

Each of these is a statement about a proportion of something, and the sampling gives us numerical estimates for each. Along with their associated sampling errors, these figures are strong evidence that the statements above are factually correct. (The relevant sampling error formula is for “proportion estimation from a random sample without replacement,” and gives a standard error of about ±5% for our sample size.)

We also used sampling to estimate the number of incidents of contractor-caused injury to Iraqis that we might not have found. During the reporting process we found 14 such incidents (1,2,3,4,5,6,7,8,9,10,11,12,13,14) but keyword search is not reliable for a variety of reasons. For example it is based on the scanned text, which is very error-prone. Could we be missing another few dozen such incidents? We can say with high probability that the answer is no, because we independently estimated the number of such incidents using our sample, and found it to be 2% ±2% out of 666, or most likely somewhere between 0 and 26 documents, with an expected value of 13. So while we are almost certainly missing a few incidents, it’s very unlikely that we’re missing more than a handful.

Other sources
Documents never tell the whole story; they’re simply one source, ideally one source of many. For this story, we first consulted with AP reporter Lara Jakes, who has been covering events from Baghdad for many years, and has written about private security contractors in particular. She provided a crucial reality check to make sure we understood the complex environment that the documents referred to. We also looked at many other document sources, including the multitude of lengthy government reports that this issue has generated over the years.

We then set up a call with the Department of State. Undersecretary for Management Patrick Kennedy spent almost an hour on the phone with us, and his staff worked hard to answer our followup questions. In addition to useful background information, they provided us with the number of cases concerning security contractor misconduct that the State Department has referred to the Department of Justice: five. They also told us that there were 5,648 protected diplomatic motorcades in Iraq in 2007. These figures add crucial context to the incident counts we were able to pull out of the document set, and we do not believe that either has been been previously reported.

Finally, we searched news archives and other sources, such as the Iraq Body Count database, to see if the incidents of Iraqi injury we found had been previously reported. Of  the fourteen incidents, four appear to have been documented elsewhere. We believe this document and this news report refer the same incident, as well as this and this, and we suspect also this is the same as record d0233 in the Iraq Body Count database, while this matches record d4900. Of course, there may be other records of these events, but after this search we suspect that many of the incidents we found were previously unreported.

Next steps
This is the first major story completed using Overview, which is still in prototype form. We learned a lot doing it, and the practical requirements of reporting this story drove the development of the software in really useful ways. The code is up on GitHub, and over the next few weeks we will be releasing training materials which we hope will allow other people to use it successfully. We will also hold a training session at the NICAR conference this week. The software itself is also being continually improved. We have a lot of work to do.

Our next step is actually a complete rewrite, to give the system a web front end and integrate it with DocumentCloud. This will make it accessible to many more people, since many journalists already use DocumentCloud and a web UI means there is nothing to download and install. We’re hiring engineers to help us do this; for details on the plan, please see our job posting.

What did private security contractors do in Iraq?

The U.S. employed more private contractors in Iraq than in any previous war, at times exceeding the number of regular military personnel, and roughly 10% of them were in armed roles by the end of the war. A few high-profile incidents made headlines, such as the Blackwater shootings at Nisoor Square in September 2007, but there hasn’t yet been a comprehensive public record of these private security contractors’ actions at the height of the war. Thousands of pages of recently released material changes that — and provides an ideal test case for Overview’s evolving document mining capabilities.

The documents show that mostly, these contractors fired at approaching civilian vehicles to protect U.S. motorcades from the threat of suicide bombers. The documents also show how often shots were fired, and provide a window into how State Department oversight of security contractors tightened during the war.

The documents come from a Freedom of Information request filed with the U.S. Department of State by journalist John Cook in November 2008. Cook received the paperwork in batches over the last 18 months, and posted the 4,500 pages of incident reports and supporting investigation records from the Bureau of Diplomatic Security on DocumentCloud.

The record only covers the work of State Department contractors between 2005 and 2007; the majority of U.S. contractors worked for the Department of Defense, according to a 2008 Government Accountability Office report. The State Department also has excluded some documents relating to ongoing criminal investigations or national security. Nonetheless, this is the most exhaustive record we have, and offers us the possibility of moving beyond anecdotes to broader patterns.

In addition to the document analysis, we spoke with Undersecretary for Management Patrick Kennedy, who oversees the State Department’s Bureau of Diplomatic Security.  That conversation provided context for these events. His assistant, Christina Maier, answered many of our specific questions.

For details on how we used the Overview prototype to report on these documents, including the exact methodology, see this post.

What did private security contractors do?

The documents cover about 600 incidents that involved security contractors firing a weapon in Iraq. It’s not clear exactly how the department decided whether a report was warranted.  Some reports are many pages long, including witness testimony and extended investigative reports.  In other cases, only a terse cover page exists.  The documents mostly concern the actions of the three private contractors then working for the State Department: Blackwater, DynCorp, and Triple Canopy. A handful of incidents involve KBR, another contractor; and the U.S. Marines.

The majority of incidents, about 65 percent, involve a contractor team assigned to protect a U.S. motorcade firing into an “aggressive” or “threatening” vehicle.

A typical example, involving a detail protecting involving workers for the U.S. Agency for International Development in Baghdad, reads:

At approximately 0950, 11 May 05, a USAID PSD [private security detail] Team fired four rounds into the hood of a dark colored BMW taxi after the driver of the vehicle moved around a line of traffic, failed to yield to verbal and hand signals and approached the PSD vehicles while the detail was slowing for congested traffic. Upon receiving fire, the BMW slowed its approach and rolled to a stop against a bus parked on the right side of the road. The PSD exited the area and continued with their mission without further incident. There were three USAID principals onboard at the time of the incident. No friendly personnel were affected. Status of driver and hostile vehicle is unknown at this time.

The bulk of the documents report hundreds of such incidents with minor variations. The report always includes at least a brief mention of the ways that the contractors tried to stop the vehicle before shooting. Sometimes, “verbal commands” or “visual signals” are mentioned. In other cases the contractors tried flashing lights, threw water bottles, or fired flares or smoke grenades before firing.

Motorcade guards shot vehicles that approached too closely because of the threat of vehicle suicide bombers, known as “vehicle-borne improvised explosive devices” or “VBIEDs.” It’s not clear how many of the vehicles were actually a threat; there is no record of followup investigations in an estimated 95 percent of the reports. There are few details about what happens to the driver of the vehicles that were shot at; sometimes, a report  states that the driver “did not appear to be injured.” In other instances, there is no comment at all.

Most reports describe a few rounds fired into the front of the vehicle, that succeed in stopping the car. On other occasions, gunners fired into the car windows if shots to the front grille didn’t stop the car.  We found a number of incidents where, after nonviolent warnings, contractors fired into windows first (1,2,3,4,5). On two of these occasions, gunners said that they “didn’t have time to shoot to disable,” which was acceptable under the policies then in force.

Some of the drivers didn’t stop the cars, and just kept  kept going after taking bullets. One taxi took four rounds and “continued to push past the motorcade.”

We found 10 recorded Iraqi deaths, and a smaller number of injuries. In one case a bullet went through the windshield and hit the driver’s right shoulder. The team “provided first aid and turned the man over to a local national who stated that he was a doctor.” In another case, an ambulance was called and the team waited, but the driver eventually refused help and left the scene. But in general these contractors do not seem to have been equipped to deliver medical aid. After one fatal shooting, the investigator who interviewed the team noted, “Vehicle was engaged due to possible VBIED; there is no standard operating procedure for PSD teams to search vehicles render aide to [sic] in such an incident.”

The documents show that shots were also fired as the result of  misunderstandings. After Marines fired on a car trying to enter the U.S. Embassy Annex through the exit lane, investigators concluded that “the local national had no apparent hostile intention and his actions were based on his misunderstanding of the new security procedures.” On another occasion a Marine fired at a vehicle driven by “a U.S. citizen employed by the U.S. Army Corps of Engineers” who “was talking on his cellular telephone and didn’t follow the Marines’ directions.” In another incident, a DynCorp. team shot at an Iraqi judge after he failed to stop his car, hitting him in the leg.

The bulk of the documents concern this type of “escalation of force” against a vehicle, but a smaller number of documents report contractor responses to attacks on U.S. personnel.  A motorcade was fired upon by fighters on the roof of an abandoned five-story building.  An attack on the “Municipalities and Public Works Annex building” ultimately killed five U.S. personnel  in a helicopter crash, and was later cited by the State Department as an example of heroic behavior by a contractor. There was an attack on Baghdad’s city hall, and another at a Doura power plant. There are also several instances of Blackwater aircraft brought down by small arms fire (1,2,3).

About 45 percent of the reports describe events happening outside of Baghdad. In the provincial capital of Basra, the palace compound was repeatedly attacked by rockets. In what was described as a “suicide probe,” a man carrying a “white bag” approached the gates of the U.S. Embassy in Basra and would not stop after warnings and a flash grenade. Guards shot him. There are also a half-dozen reports of suspicious boats approaching the Embassy building from the riverside, and in one case a Triple Canopy contractor fired upon a boat after it ignored flares and warning shots.

Finally, there are a handful of reports of contractors shooting aggressive stray dogs. In one instance a Blackwater contractor killed a dog  that belonged to the New York Times’  Baghdad bureau, after it fought with the contractor’s bomb-sniffing dog.

Tightening oversight

The documents show that the shootings led to greater oversight as the war progressed. In February 2005, Blackwater guards fired over 100 rounds at a car approaching their motorcade on the other side of a median, hitting the driver. The contractors initially maintained that the car’s passenger had fired into their vehicle, but investigators later found that the Blackwater guards had fired them. They also claimed that the car was on a pre-existing list of suspicious vehicles, known as the “be on the lookout” list.

Yet one of the guards later told investigators that claiming that the vehicle was on this list was “simply standard practice when reporting a shooting incident, per Blackwater management.”

The investigator’s report says that “several of the PSD individuals involved in the shooting provided false statements to the investigators,” but the head of diplomatic security in Baghdad, John Frese, decided not to discipline the contractors because it “would be deemed as lowering the morale of the entire PSD entity.”

The State Department declined a request to comment on this incident.

The investigator’s February 2005 report recommended several policy changes, including posting signs on motorcade vehicles stating “stay back 100 meters” in English and Arabic, counting the number of rounds fired after every shooting incident, and “establishing a clear and unambiguous policy regarding appropriate use of warning/disabling shots at vehicles.”

The documents include a State Department security contractor policy manual dated August 2005 with such guidelines. The manual said that shooting at approaching vehicles is authorized “if it constitutes the appropriate level of force to mitigate the threat.” Shots can be fired into a car “to prohibit a threat from entering into an area where the protective detail would be exposed to an attack,” the manual says. It also advises contractors to issue visible and verbal warnings before firing.

This policy also requires an internal investigation and written reports from all shooters and witnesses any time a firearm is discharged.

Were problems common?

Out of about 600 incidents in total, the AP found 14 incidents where an Iraqi was injured by contractor gunfire, including 8 deaths. (1,2,3,4,5,6,7,8,9,10,11,12,13,14).

The State Department told us that there were 5,648 protected diplomatic motorcades in Iraq in 2007. Our analysis found that only about 2 percent of the 2007 motorcades in Iraq resulted in a shooting. This agrees closely with previous estimates that between 1 percent and 3 percent of the motorcades involved shootings, according to congressional testimony.

Out of all the cases where contractors used force, the State Department told us that a total of five cases have been referred to the Department of Justice for possible prosecution.

Prosecution doomed from the start

On September 17, 2007, guards working for Blackwater Worldwide shot and killed 17 Iraqi civilians in Nisoor Square, Baghdad. The incident received international media attention and spawned a congressional hearing. But the criminal case against five former Blackwater contractors was dropped after a judge ruled that government prosecutors improperly relied on statements that the State Department compelled the contractors to make.

The documents analyzed by the AP provide an important clue as to how this might have happened. There is a frequently used “sworn statement” form for contractors (like this example) which states “I further understand that neither my statements nor any information or evidence gained by reason of my statements can be used against me in a criminal proceeding.”  Such statements, mandatory whenever shots were fired, suggest that contractors were effectively granted automatic immunity immediately after any incident.

Even if that were not the case, it’s not clear what laws would cover alleged crimes.  Security contractors in Iraq were immune from Iraqi law until the end of 2008, while current U.S. laws may not cover the acts of overseas armed contractors not directly involved in a Department of Defense mission.

Blackwater, now known as Academi, settled a civil suit with the families of several of the Nisoor Square victims in January.

After Nisoor Square

The documents reviewed by the AP do not include the Nisoor Square shootings, which triggered major changes in contractor oversight. An expert panel convened by then-Secretary of State Condoleezza Rice recommended 18 specific policy changes. According to a subsequent GAO report, the State Department implemented most of the changes,  including placing at least one government security agent in each motorcade, installing video cameras in all vehicles, and recording both radio transmissions and satellite-tracked vehicle locations.

According to the same report, the number of weapons discharges by security contractors working for both the Department and Defense and the Department of State decreased by 60 percent after the changes went into effect. Military and civilian casualties also fell greatly during the same time period, making it difficult to know if new policies resulted in fewer shootings.

Undersecretary Kennedy was on the expert panel that made those policy recommendations. He noted that no official in a department-escorted convoy has ever been killed in Iraq. (There have been deaths from other causes, such as mortar attacks.)

“We try not to be draconian about it,” he said. “Could we have done the same with less use of force? I don’t know how you could validate retrospectively that the escalation wasn’t appropriate.”

Either way, the State Department will continue to use security contractors in Iraq and worldwide. Kennedy said the number of security contractors working for the department in Iraq has increased since U.S.  troops left the country because the department now has additional security responsibilities, including the protection of six Iraqi military training sites.

“There are only about 1700 State Dept. special agents in the world,” he said. “We have 280 embassies. There is no way I can take 1700 special agents and about 100 officers and stretch them to do my mission without contractors.”


3 Difficult Document-Mining Problems that Overview Wants to Solve

The Overview project is an attempt to create a general-purpose document set exploration system for journalists. But that’s a pretty vague description. To focus the project, it’s important to have a set of test cases — real-world problems that we can use to evaluate our developing system.

In many ways, the test cases define the problem. They give us concrete goals, and a way to understand how well or poorly we are achieving those goals. These tests should be diverse enough to be representative of the problems that journalists face when reporting on document sets, and challenging enough to push us to innovate. There’s also value in using material that is already well-studied, so we can compare the results using Overview to what we’ve already learned using other techniques.

With that in mind, we’ve been scouring the document set lore and the AP’s own archives to find good test data. Here are three types of problems we’d like Overview to address, and some document sets that provide good examples of each.

A large set of structured documents — the Wikileaks files
Wikileaks published the Afghanistan and Iraq war logs data sets last year, and recently the full archive of U.S. diplomatic cables has also become available. All three archives are the same basic type: hundreds of thousands of documents in identical format.

Each document has the same set of pre-defined fields, such as date, location, incident type, originating embassy, etc. But this isn’t just a series of fill-in-the-blank forms, because each document also includes a main text field that is written in plain English (well, English with a lot of jargon). We call these types of documents “semi-structured,” and part of the analysis work here is understanding the relationship between the free-form text and the structured fields.

For example, our previous visualizations of the war logs use the topics discussed in the text to cluster the dots that represent each document, but the color is from the “incident type” field: red for “explosive hazard,” light blue for “enemy action,” dark blue for “criminal event,” and so on. The human eye can interpret color and shapes at the same time, so this allows us to literally see the relationship between topics and incident types.


There are lots of other large, homogeneous, semi-structured document sets of interest to journalists. Corporate filings are a prime example, but we might also want to analyze legislative records (as the AP did to learn how “9/11” was invoked in the U.S. Congress over the last 10 years), or the police reports of a particular city.

The key feature of this type of document set is that all the documents are the same type, in the same format, and there are a lot of them. The Wikileaks war logs and cables are a good specific test because they are widely available and already well-studied, so we can see whether Overview helps us see stories that we already know are there.

Communications records — the Enron emails
Federal investigators released a large set of internal emails after the spectacular collapse of the Enron corporation in 2001. The Enron corpus contains more than 600,000 emails written by 158 different people within the company. It has been widely used to study both this specific case of corporate wrongdoing, and to explore broader principles and techniques in social network analysis.

The simplest way to visualize a huge pile of emails is to plot each email address as a node and draw edges when one person emailed another. That produces a plot of the the social network of communicators, such as this one from Stanford University assistant professor Jeffery Heer’s Exploring Enron project:


But there are other ways to understand this data set. For example, this plot excludes the element of time. Perhaps a group of conspirators gradually stopped talking to outsiders, or maybe power shifted from one branch of the company to another over time. These sorts of questions are addressed by dynamic network analysis. You could also ignore the social network completely and try to plot the threads of conversation, where one message refers back to an earlier one by someone else, as the IBM’s thread arc project did.

Email dumps are increasingly common, especially with the recent uptick of hacking by collectives such as Anonymous and Lulzsec. But the concepts and tools used to analyze email can be applied to a broader category: any record of communications between a set of people. These could be emails, IM transcripts, Facebook messages, or a large set of Twitter traffic. To be useful for this type of analysis each record must contain at least the date, the sender, the recipient(s), and the message itself. There might also be things like subject lines or references to previous messages, which can be very useful in tracing the evolution of a conversation.

Messy document dumps — the BP oil spill records
Freedom of Information laws don’t require governments to organize the documents they give back. In August of last year, the AP asked several U.S. federal agencies for all documents relating to the production of the report “BP Deepwater Horizon Oil Budget: What Happened to the Oil?” And we got them, in a 7,000-page PDF file. There are early drafts of the report, meeting minutes, email threads, internal reports, spreadsheets … The first step in mass analysis of this material is simply sorting it into categories.

BP oil spill example.png

Document classification algorithms can be used to automate this process, by scanning the text of each page and determining if it’s an email, a spreadsheet, or some other type of document. Then we can proceed with specialized visualization of each of these types of documents. For example, we could visualize the social network of the extracted emails.

This sorting process isn’t itself a visualization, because the output is several different piles of sorted documents, not a picture. But it’s an extremely important task, because a huge part of the work in any data journalism project is just getting everything in the right format and ready for the real analysis. Although Overview is designed for visualization, it needs to include powerful tools for data preparation and cleanup.

The Wikileaks and Enron test cases involve a large collection of identically formatted documents. The BP oil spill documents are different, because they’re anything but homogenous. This is an important test case because it represents a problem that comes up often in journalism, especially when we want to understand what we got back from a big Freedom of Information request.

Anything else?
If Overview could help with just these three problems, it would be an extremely valuable tool for journalists. But we need to make sure they’re the right problems. Are you trying to report on a large set of documents that isn’t anything like these cases? Please let us know!

A visualization sketching system

Over the last year, my colleagues and I at The Associated Press have been exploring visualizations of very large collections of documents. We’re trying to solve a pressing problem: We have far more text than hours to read it. Sometimes a single Freedom of Information request will produce a thousand pages, to say nothing of the increasingly common WikiLeaks-sized dumps of hundreds of thousands of documents, or hugedatabases of public documents.

Because reading every word is impossible, a large data set is only as good as the tools we use to access it. Search can help us find what we’re looking for, but only if we know what we are looking for. Instead, we’ve been trying to make “maps” of large data sets, visualizations of the topics or locations or the interconnections between people, dates, and places. We’ve had a few notable successes, such as our  visualization of the Iraq war logs.

But frankly, this has been a slow process, because the tools for large-scale text analysis are terrible. Existing programs break when faced with more than a few thousand documents. More powerful software exists, but only in component form. It requires lots of programming to get a useful result.

Meanwhile, DIY visualization thrives. At the Eyeo festival in Minneapolis this summer, I was overwhelmed by the vibrant community that has formed around data visualization. Several hundred people sat in a room and listened raptly to talks by data artist Jer Thorp, social justice visualizer Laura Kurgan, the measurement-obsessed Nick Felton, and many others. Suddenly, a great many people are enthusiastically making images from code and data.

The weapon of choice for this community is Processing, a language designed specifically for interactive graphics by Ben Fry and Casey Reas (both of whom were at Eyeo). Creative communities thrive on good tools; think of Instagram, Instructables, or Wikipedia.

We want Overview to be the creative tool for people who want to explore text visualization — “investigative journalists and other curious people,” as our grant application put it.

The algorithms that our prototypes use are old by tech standards, dating mostly from information retrieval research in the ’80s. But then, the algorithms that the resurgent visualization community is implementing in Processing are mostly old, too; I coded many of them in C++ in the early 1990s when I was learning computer graphics programming. Today, one doesn’t have to learn C++ to make pictures with algorithms. The Processing programming environment takes care of all the hard and boring parts and provides a simple, lightweight syntax. It’s a visualization “sketching” system, tailor-made for the rapid expression of visual ideas in code.

No such programming environment exists if you want to do visualizations of the text content of large document sets. First, you have to extract some sort of meaning from the language. Natural language processing has a long history and is advancing rapidly, but the available toolkits still require a huge amount of specialist knowledge and programming skill.

Big data also requires many computers running in parallel, and while there are now wonderful components such as distributed NoSQL stores and the Hadoop map-reduce framework, it’s a lot of work to assemble all the pieces. The current state of the art simply doesn’t lend itself to experimentation. I’d love for people with modest technical ability to be able to play around with document set visualizations, but we don’t have the right tools.

This is the hole that we’d like Overview to fill. There are certain key problems, such as email visualization, that we know Overview has to solve. But we’d like to solve them by building a sort of text visualization programming system. The idea is to provide basic text processing operations as building blocks, letting the user assemble them into algorithms. It should be easy to recreate classic techniques, or invent new ones by trial and error. The distributed storage and data flow should be handled automatically behind the scenes, as much as possible.

That’s an ambitious project, and we are going to have to scale it down. Perhaps the first version of Overview won’t be as expressive or efficient as we’d like; we are explicitly prioritizing useful solutions to real problems over elegant tools that can’t be used for actual analysis. By the end of our Knight Foundation grant, Overview has to solve at least one difficult and essential problem in data journalism.

But ultimately, what we intend to build is a sketching system for visualizing the content and meaning of large collections of text documents — big text, as opposed to big data. Just as the Processing language has been a great enabler of the DIY visualization community, we hope that Overview will give interested folks a simple way to play with lots of different text processing techniques — and that we’ll all learn some interesting things from mining our ever-increasing store of public documents.

This post was originally published at PBS IdeaLab.

Overview is hiring!

We need two Java or Scala developers to build the core analytics and visualization components of Overview, and lead the open-source development community. You’ll work in the newsroom at AP’s global headquarters in New York, which will give you plenty of exposure to the very real problems of large document sets.

The exact responsibilities will depend on who we hire, but we imagine that one of these positions will be more focused on user experience and process design, while the other will do the computer science heavy lifting — though both must be strong, productive software engineers. Core algorithms must run on a distributed cluster, and scale to millions of documents. Visualization will be through high-performance OpenGL. And it all has to be simple and obvious for a reporter on deadline who has no time to fight technology. You will be expected to implement complex algorithms from academic references, and expand prototype techniques into a production application.

You will work closely with investigative reporters on real stories, ensuring that the developing application serves their real world document-dump reporting needs. You will also work with visualization experts and other specialists from across industry and academia, and act as the technical lead for the open-source development and user communities.

We can offer competitive salaries for this two-year contract. Please send your resume to


  • demonstrated ability to design and a ship large application with a clean, minimal, functional user interface
  • BSc. in CS, EE, or equivalent familiarity with computer science theory
  • mathematical ability, especially statistical models and linear algebra
  • 5 years experience as a Java software developer
  • familiarity with distributed open source development projects
  • experience in computer graphics and distributed systems a plus



Investigating thousands (or millions) of documents by visualizing clusters

This is a recording of my talk at the NICAR (National Institute of Computer-Assisted Reporting) conference in February 2011, where I discussed some of our recent work at the AP with the Iraq and Afghanistan war logs.

References cited in the talk:

  • “A full-text visualization of the Iraq war logs”, a detailed writeup of the technique used to generate the first set of maps presented in the talk.
  • The Glimmer high-performance, parallel multi-dimensional scaling algorithm, which is the software I presented in the live demo portion. It will be the basis of our clustering work going forward. (We are also working on other large-scale visualizations which may be more appropriate for e.g. email dumps.)
  • “Quantitative Discovery from Qualitative Information: A General-Purpose Document Clustering Methodology.” Justin Grimmer, Gary King, 2009. A paper that everyone working in document clustering needs to read. It clearly makes the point that there is no “best” clustering, just different algorithms that correspond to different pre-conceived frames on the story — and gives a method to compare clusterings (though I don’t think it will scale well to millions of docs.)
  • Wikipedia pages for bag of words model, tf-idf, and cosine similarity, the basic text processing techniques we’re using.
  • Gephi, a free graph visualization system, which we used for the one-month Iraq map. It will work up to a few tens of thousands of nodes.

What can we accomplish in two years?

Overview is an ambitious project. The prototype workflow is based on automatically clustering documents by analyzing patterns of word usage, and our results so far are very promising. But that doesn’t immediately mean that this is the direction that development should take. There is a whole universe of document and data set problems facing journalists today, and wide range of computational linguistics, visualization, and statistical methods we could try to apply. The space of possibility is huge.

But we can already say a few things about what must be accomplished for the project to be considered a success. We’re thinking not only about what must be accomplished by the end of the two-year grant, but how we’d like the project to evolve long after that. Within the space of our two year grant, we think we need to do the following things:

  1. Build an active community of developers and users
  2. Release production software that works well for common problems
  3. Develop a core, scalable architecture that vastly increases the pace of R&D

We need your help on each of these goals, in different ways.
Continue reading What can we accomplish in two years?

A full-text visualization of the Iraq war logs

This is a description of some of the proof-of-concept work that led to the Overview prototype, originally posted elsewhere.

Last month, my colleague Julian Burgess and I took a shot a peering into the Iraq War Logs by visualizing them in bulk, as opposed to using keyword searches in an attempt to figure out which of the 391,832 SIGACT reports we should be reading. Other people have created visualizations of this unique document set, such as plots of the incident locations on a map of Iraq, and graphs of monthly casualties. We wanted to go a step further, by designing a visualization based on the richest part of each report: the free text summary, where a real human describes what happened, in jargon-inflected English.

Also, we wanted to investigate more general visualization techniques. At The Associated Press we get huge document dumps on a weekly or sometimes daily basis. It’s not unusual to get 10,000 pages from a FOIA request — emails, court records, meeting minutes, and many other types of documents, most of which don’t have latitude and longitude that can be plotted on a map. And all of us are increasingly flooded by large document sets released under government transparency initiatives. Such huge files are far too large to read, so they’re only as useful as our tools to access them. But how do you visualize a random bunch of documents?

We’ve found at least one technique that yields interesting results, a graph visualization where each document is node, and edges between them are weighted using cosine-similarity on TF-IDF vectors. I’ll explain exactly what that is and how to interpret it in a moment. But first, the journalism. We learned some things about the Iraq war. That’s one sense in which our experiment was a success; the other valuable lesson is that there are a boatload of research-grade visual analytics techniques just waiting to be applied to journalism.

click for super hi-res version

Interpreting the Iraq War, December 2006
This is a picture of the 11,616 SIGACT (“significant action”) reports from December 2006, the bloodiest month of the war. Each report is a dot. Each dot is labelled by the three most “characteristic” words in that report. Documents that are “similar” have edges drawn between them. The location of the dot is abstract, and has nothing to do with geography. Instead, dots with edges between them are pulled closer together. This produces a series of clusters, which are labelled by the words that are most “characteristic” of the reports in that cluster. I’ll explain precisely what “similar” and “characteristic” mean later, but that’s the intuition.

Continue reading A full-text visualization of the Iraq war logs