3 Difficult Document-Mining Problems that Overview Wants to Solve

The Overview project is an attempt to create a general-purpose document set exploration system for journalists. But that’s a pretty vague description. To focus the project, it’s important to have a set of test cases — real-world problems that we can use to evaluate our developing system.

In many ways, the test cases define the problem. They give us concrete goals, and a way to understand how well or poorly we are achieving those goals. These tests should be diverse enough to be representative of the problems that journalists face when reporting on document sets, and challenging enough to push us to innovate. There’s also value in using material that is already well-studied, so we can compare the results using Overview to what we’ve already learned using other techniques.

With that in mind, we’ve been scouring the document set lore and the AP’s own archives to find good test data. Here are three types of problems we’d like Overview to address, and some document sets that provide good examples of each.

A large set of structured documents — the Wikileaks files
Wikileaks published the Afghanistan and Iraq war logs data sets last year, and recently the full archive of U.S. diplomatic cables has also become available. All three archives are the same basic type: hundreds of thousands of documents in identical format.

Each document has the same set of pre-defined fields, such as date, location, incident type, originating embassy, etc. But this isn’t just a series of fill-in-the-blank forms, because each document also includes a main text field that is written in plain English (well, English with a lot of jargon). We call these types of documents “semi-structured,” and part of the analysis work here is understanding the relationship between the free-form text and the structured fields.

For example, our previous visualizations of the war logs use the topics discussed in the text to cluster the dots that represent each document, but the color is from the “incident type” field: red for “explosive hazard,” light blue for “enemy action,” dark blue for “criminal event,” and so on. The human eye can interpret color and shapes at the same time, so this allows us to literally see the relationship between topics and incident types.


There are lots of other large, homogeneous, semi-structured document sets of interest to journalists. Corporate filings are a prime example, but we might also want to analyze legislative records (as the AP did to learn how “9/11″ was invoked in the U.S. Congress over the last 10 years), or the police reports of a particular city.

The key feature of this type of document set is that all the documents are the same type, in the same format, and there are a lot of them. The Wikileaks war logs and cables are a good specific test because they are widely available and already well-studied, so we can see whether Overview helps us see stories that we already know are there.

Communications records — the Enron emails
Federal investigators released a large set of internal emails after the spectacular collapse of the Enron corporation in 2001. The Enron corpus contains more than 600,000 emails written by 158 different people within the company. It has been widely used to study both this specific case of corporate wrongdoing, and to explore broader principles and techniques in social network analysis.

The simplest way to visualize a huge pile of emails is to plot each email address as a node and draw edges when one person emailed another. That produces a plot of the the social network of communicators, such as this one from Stanford University assistant professor Jeffery Heer’s Exploring Enron project:


But there are other ways to understand this data set. For example, this plot excludes the element of time. Perhaps a group of conspirators gradually stopped talking to outsiders, or maybe power shifted from one branch of the company to another over time. These sorts of questions are addressed by dynamic network analysis. You could also ignore the social network completely and try to plot the threads of conversation, where one message refers back to an earlier one by someone else, as the IBM’s thread arc project did.

Email dumps are increasingly common, especially with the recent uptick of hacking by collectives such as Anonymous and Lulzsec. But the concepts and tools used to analyze email can be applied to a broader category: any record of communications between a set of people. These could be emails, IM transcripts, Facebook messages, or a large set of Twitter traffic. To be useful for this type of analysis each record must contain at least the date, the sender, the recipient(s), and the message itself. There might also be things like subject lines or references to previous messages, which can be very useful in tracing the evolution of a conversation.

Messy document dumps — the BP oil spill records
Freedom of Information laws don’t require governments to organize the documents they give back. In August of last year, the AP asked several U.S. federal agencies for all documents relating to the production of the report “BP Deepwater Horizon Oil Budget: What Happened to the Oil?” And we got them, in a 7,000-page PDF file. There are early drafts of the report, meeting minutes, email threads, internal reports, spreadsheets … The first step in mass analysis of this material is simply sorting it into categories.

BP oil spill example.png

Document classification algorithms can be used to automate this process, by scanning the text of each page and determining if it’s an email, a spreadsheet, or some other type of document. Then we can proceed with specialized visualization of each of these types of documents. For example, we could visualize the social network of the extracted emails.

This sorting process isn’t itself a visualization, because the output is several different piles of sorted documents, not a picture. But it’s an extremely important task, because a huge part of the work in any data journalism project is just getting everything in the right format and ready for the real analysis. Although Overview is designed for visualization, it needs to include powerful tools for data preparation and cleanup.

The Wikileaks and Enron test cases involve a large collection of identically formatted documents. The BP oil spill documents are different, because they’re anything but homogenous. This is an important test case because it represents a problem that comes up often in journalism, especially when we want to understand what we got back from a big Freedom of Information request.

Anything else?
If Overview could help with just these three problems, it would be an extremely valuable tool for journalists. But we need to make sure they’re the right problems. Are you trying to report on a large set of documents that isn’t anything like these cases? Please let us know!

A visualization sketching system

Over the last year, my colleagues and I at The Associated Press have been exploring visualizations of very large collections of documents. We’re trying to solve a pressing problem: We have far more text than hours to read it. Sometimes a single Freedom of Information request will produce a thousand pages, to say nothing of the increasingly common WikiLeaks-sized dumps of hundreds of thousands of documents, or hugedatabases of public documents.

Because reading every word is impossible, a large data set is only as good as the tools we use to access it. Search can help us find what we’re looking for, but only if we know what we are looking for. Instead, we’ve been trying to make “maps” of large data sets, visualizations of the topics or locations or the interconnections between people, dates, and places. We’ve had a few notable successes, such as our  visualization of the Iraq war logs.

But frankly, this has been a slow process, because the tools for large-scale text analysis are terrible. Existing programs break when faced with more than a few thousand documents. More powerful software exists, but only in component form. It requires lots of programming to get a useful result.

Meanwhile, DIY visualization thrives. At the Eyeo festival in Minneapolis this summer, I was overwhelmed by the vibrant community that has formed around data visualization. Several hundred people sat in a room and listened raptly to talks by data artist Jer Thorp, social justice visualizer Laura Kurgan, the measurement-obsessed Nick Felton, and many others. Suddenly, a great many people are enthusiastically making images from code and data.

The weapon of choice for this community is Processing, a language designed specifically for interactive graphics by Ben Fry and Casey Reas (both of whom were at Eyeo). Creative communities thrive on good tools; think of Instagram, Instructables, or Wikipedia.

We want Overview to be the creative tool for people who want to explore text visualization — “investigative journalists and other curious people,” as our grant application put it.

The algorithms that our prototypes use are old by tech standards, dating mostly from information retrieval research in the ’80s. But then, the algorithms that the resurgent visualization community is implementing in Processing are mostly old, too; I coded many of them in C++ in the early 1990s when I was learning computer graphics programming. Today, one doesn’t have to learn C++ to make pictures with algorithms. The Processing programming environment takes care of all the hard and boring parts and provides a simple, lightweight syntax. It’s a visualization “sketching” system, tailor-made for the rapid expression of visual ideas in code.

No such programming environment exists if you want to do visualizations of the text content of large document sets. First, you have to extract some sort of meaning from the language. Natural language processing has a long history and is advancing rapidly, but the available toolkits still require a huge amount of specialist knowledge and programming skill.

Big data also requires many computers running in parallel, and while there are now wonderful components such as distributed NoSQL stores and the Hadoop map-reduce framework, it’s a lot of work to assemble all the pieces. The current state of the art simply doesn’t lend itself to experimentation. I’d love for people with modest technical ability to be able to play around with document set visualizations, but we don’t have the right tools.

This is the hole that we’d like Overview to fill. There are certain key problems, such as email visualization, that we know Overview has to solve. But we’d like to solve them by building a sort of text visualization programming system. The idea is to provide basic text processing operations as building blocks, letting the user assemble them into algorithms. It should be easy to recreate classic techniques, or invent new ones by trial and error. The distributed storage and data flow should be handled automatically behind the scenes, as much as possible.

That’s an ambitious project, and we are going to have to scale it down. Perhaps the first version of Overview won’t be as expressive or efficient as we’d like; we are explicitly prioritizing useful solutions to real problems over elegant tools that can’t be used for actual analysis. By the end of our Knight Foundation grant, Overview has to solve at least one difficult and essential problem in data journalism.

But ultimately, what we intend to build is a sketching system for visualizing the content and meaning of large collections of text documents — big text, as opposed to big data. Just as the Processing language has been a great enabler of the DIY visualization community, we hope that Overview will give interested folks a simple way to play with lots of different text processing techniques — and that we’ll all learn some interesting things from mining our ever-increasing store of public documents.

This post was originally published at PBS IdeaLab.

Overview is hiring!

We need two Java or Scala developers to build the core analytics and visualization components of Overview, and lead the open-source development community. You’ll work in the newsroom at AP’s global headquarters in New York, which will give you plenty of exposure to the very real problems of large document sets.

The exact responsibilities will depend on who we hire, but we imagine that one of these positions will be more focused on user experience and process design, while the other will do the computer science heavy lifting — though both must be strong, productive software engineers. Core algorithms must run on a distributed cluster, and scale to millions of documents. Visualization will be through high-performance OpenGL. And it all has to be simple and obvious for a reporter on deadline who has no time to fight technology. You will be expected to implement complex algorithms from academic references, and expand prototype techniques into a production application.

You will work closely with investigative reporters on real stories, ensuring that the developing application serves their real world document-dump reporting needs. You will also work with visualization experts and other specialists from across industry and academia, and act as the technical lead for the open-source development and user communities.

We can offer competitive salaries for this two-year contract. Please send your resume to jstray@ap.org.


  • demonstrated ability to design and a ship large application with a clean, minimal, functional user interface
  • BSc. in CS, EE, or equivalent familiarity with computer science theory
  • mathematical ability, especially statistical models and linear algebra
  • 5 years experience as a Java software developer
  • familiarity with distributed open source development projects
  • experience in computer graphics and distributed systems a plus



Investigating thousands (or millions) of documents by visualizing clusters

This is a recording of my talk at the NICAR (National Institute of Computer-Assisted Reporting) conference in February 2011, where I discussed some of our recent work at the AP with the Iraq and Afghanistan war logs.

References cited in the talk:

  • “A full-text visualization of the Iraq war logs”, a detailed writeup of the technique used to generate the first set of maps presented in the talk.
  • The Glimmer high-performance, parallel multi-dimensional scaling algorithm, which is the software I presented in the live demo portion. It will be the basis of our clustering work going forward. (We are also working on other large-scale visualizations which may be more appropriate for e.g. email dumps.)
  • “Quantitative Discovery from Qualitative Information: A General-Purpose Document Clustering Methodology.” Justin Grimmer, Gary King, 2009. A paper that everyone working in document clustering needs to read. It clearly makes the point that there is no “best” clustering, just different algorithms that correspond to different pre-conceived frames on the story — and gives a method to compare clusterings (though I don’t think it will scale well to millions of docs.)
  • Wikipedia pages for bag of words model, tf-idf, and cosine similarity, the basic text processing techniques we’re using.
  • Gephi, a free graph visualization system, which we used for the one-month Iraq map. It will work up to a few tens of thousands of nodes.

What can we accomplish in two years?

Overview is an ambitious project. The prototype workflow is based on automatically clustering documents by analyzing patterns of word usage, and our results so far are very promising. But that doesn’t immediately mean that this is the direction that development should take. There is a whole universe of document and data set problems facing journalists today, and wide range of computational linguistics, visualization, and statistical methods we could try to apply. The space of possibility is huge.

But we can already say a few things about what must be accomplished for the project to be considered a success. We’re thinking not only about what must be accomplished by the end of the two-year grant, but how we’d like the project to evolve long after that. Within the space of our two year grant, we think we need to do the following things:

  1. Build an active community of developers and users
  2. Release production software that works well for common problems
  3. Develop a core, scalable architecture that vastly increases the pace of R&D

We need your help on each of these goals, in different ways.
Continue reading What can we accomplish in two years?

A full-text visualization of the Iraq war logs

This is a description of some of the proof-of-concept work that led to the Overview prototype, originally posted elsewhere.

Last month, my colleague Julian Burgess and I took a shot a peering into the Iraq War Logs by visualizing them in bulk, as opposed to using keyword searches in an attempt to figure out which of the 391,832 SIGACT reports we should be reading. Other people have created visualizations of this unique document set, such as plots of the incident locations on a map of Iraq, and graphs of monthly casualties. We wanted to go a step further, by designing a visualization based on the richest part of each report: the free text summary, where a real human describes what happened, in jargon-inflected English.

Also, we wanted to investigate more general visualization techniques. At The Associated Press we get huge document dumps on a weekly or sometimes daily basis. It’s not unusual to get 10,000 pages from a FOIA request — emails, court records, meeting minutes, and many other types of documents, most of which don’t have latitude and longitude that can be plotted on a map. And all of us are increasingly flooded by large document sets released under government transparency initiatives. Such huge files are far too large to read, so they’re only as useful as our tools to access them. But how do you visualize a random bunch of documents?

We’ve found at least one technique that yields interesting results, a graph visualization where each document is node, and edges between them are weighted using cosine-similarity on TF-IDF vectors. I’ll explain exactly what that is and how to interpret it in a moment. But first, the journalism. We learned some things about the Iraq war. That’s one sense in which our experiment was a success; the other valuable lesson is that there are a boatload of research-grade visual analytics techniques just waiting to be applied to journalism.

click for super hi-res version

Interpreting the Iraq War, December 2006
This is a picture of the 11,616 SIGACT (“significant action”) reports from December 2006, the bloodiest month of the war. Each report is a dot. Each dot is labelled by the three most “characteristic” words in that report. Documents that are “similar” have edges drawn between them. The location of the dot is abstract, and has nothing to do with geography. Instead, dots with edges between them are pulled closer together. This produces a series of clusters, which are labelled by the words that are most “characteristic” of the reports in that cluster. I’ll explain precisely what “similar” and “characteristic” mean later, but that’s the intuition.

Continue reading A full-text visualization of the Iraq war logs