How Overview can organize thousands of documents for a reporter

Before computers, all document-driven stories started with a big stack of paper. Often, the first task was to organize all that paper, by sorting individual documents into piles by type. This gives journalists a high-level idea of “what’s in there” and helps them decide what to read more closely — and just as importantly, what isn’t worth reading.

Today a computer can organize your documents for you. That stack of paper may now be a folder full of PDF files, but it still doesn’t come with any sort of built-in index or obvious categorization system. This is exactly the problem that Overview solves: It splits documents into piles based on their subjects, and then splits each pile into even more specific sub-piles, and so on. The result is a tree of folders.

Above is part of the folder tree that Overview automatically built for the 6,849 documents containing every mention of the city of “Caracas” within the diplomatic cables released by Wikileaks. (Click on the image for a larger version.) Overview labels each folder by the key words in the documents inside. The top folder here has words like PDVSA (the Venezuelan state oil company), “oil”, “billion,” “company” and “production,” so it’s mainly documents concerning the oil industry and other big business. Other top-level folders in this document set (not shown) concern embassy politics, elections, and military operations.

The top-level folder about oil splits into two sub-folders. The one on the left concerns the oil industry specifically, while the one on the right is more about banks and finance. The oil industry folder splits further into regional issues (the Petrocaribe consortium) and documents about PDVSA specifically. Each folder splits into smaller and smaller sub-folders, each of which contains a smaller number of documents on a more specific topic. To let you know when the documents in a folder are getting very specific, Overview tells you when “MOST” or “ALL” of the documents in that folder contain a particular word.

How a computer understands topics

When I show this to reporters, their first question is always, how does the computer do that? It’s more than curiosity: If you’re going to rely on a computer to organize your documents, you’re asking a machine to help you decide what you should and shouldn’t read. The integrity of the reporting process demands that we understand what our algorithms are doing.

All document categorization algorithms are based on the ability to compare two documents to tell how similar they are. A group of documents which are all very similar to one another belong in the same folder. Computers don’t understand human language, so they need a simple mechanical process which takes two documents as input — literally just the sequence of words that make up the text of each document — and generates a number which is small if the documents are very different, and large if the documents concern the same topic.

Some text analysis systems, such as Open Calais, are based on “named entity recognition,” which extracts people, places, organizations, dates, etc. from the documents. Then, we can say that two documents are similar if they talk about the same entities. This is useful, but such systems will miss important generic words like “oil” and “production.” Instead, Overview examines every word of every document. In a sense, it reads the full text, so you don’t have to.

Comparing two documents based on their full text

Suppose we have filed an FOIA request for a classified storybook for the children of CIA operatives, and after a long legal battle. The government has given us copies of these three secret documents:

  • “The cat sat on the mat. Then the cat chased the rat.”
  • “The cat slept all day on the mat.”
  • “The rat ran across the floor.”

First, Overview strips capitalization, punctuation, and the grammar words such as “the,” “a,” “on,” etc. These words, also called stop words in natural language processing, aren’t useful for determining the topic of the text, because they appear in almost every document. This leaves us with:

  • “cat sat mat cat chased rat”
  • “cat slept all day mat”
  • “rat ran across floor”

You can see that most of the sense of the document is still there, despite removing lots of words. Then, Overview counts how many times each word appears in each document, producing a word frequency table, like this:

This throws out the order of the words, which means the computer can’t understand the difference between “soldiers shot civilians” and “civilians shot soldiers.” This may seem very simplistic, but surprisingly, decades of information retrieval research show that word order usually doesn’t matter when all you want to know is the topic of a document.

Then Overview compares every pair of documents to check how similar they are. It does this by counting the number of words which appear in both documents, but with a twist: If a word appears twice in one document, it’s counted twice. In other words, we multiply the frequencies of corresponding words, then add up the results. This is the final similarity score.

In this case, the two documents about the cat have a similarity of 3: Cat appears twice in the first document and once in the second, plus rat appears once in each document. The document about the rat has no words in common with the document about the cat sleeping on the mat, so the similarity score is zero.

Documents which are similar enough end up in the same folder, and the folder is labelled by the words which make those documents different from all the others. In this case, the folder is labeled by “cat” and “mat” because those words don’t appear in the remaining document about the rat.

And that’s the heart of it. This description omits a number of details for simplicity, but includes all of the things a reporter needs to know:

  • Overview uses the full text of each document.
  • It is not sensitive to word order.
  • Documents with overlapping words are placed in the same folder.

If you’d like to understand the process more deeply, here are a few more details: Overview actually processes text in two word bigrams, not just single words, so it can detect people’s names and other short phrases. Rather than just simple term counts, it weights each word by how rare it is in the document set overall, using a classic formula called TF-IDF. And to generate the folders, given the similarity between every pair of documents, Overview uses k-means clustering, splitting folders recursively at each level of the tree.

Try it on your own documents

Overview is available for free at overviewproject.org. It can automatically import your projects from the popular DocumentCloud repository, which also handles document upload, OCR, and other tasks. Or, you can upload a CSV file if your text is already in spreadsheet or database format. It also works great on social media data, such as a collection of tweets or blog posts.

You can learn to use Overview by watching a short video on the help page, or viewing the webinar recorded at Poynter’s NewsU.

Dealing with massive PDFs by splitting them into pages

Our users frequently face a situation where the natural document boundaries are lost. If we’re going to be analyzing 50,000 emails, ideally each email would be stored in its own file. But very often, the source material arrives as a series of massive PDF files, each of which may be thousands of pages long. Or the documents may arrive as a big stack of paper, which becomes a single massive PDF after scanning.

It can be challenging to recover the original document boundaries within a file where everything runs together. For my story on private security contractors in Iraq, I solved this problem with a custom script that tried to detect cover pages. This was a difficult and time-consuming solution.

Fortunately there is an easy trick that works well in most cases: split each long document into pages, and then have Overview sort the large number of pages, not the small number of original documents. Overview can do this automatically when importing from PDF of DocumentCloud, if you tell it that each page counts as a document.

AP reporter Jack Gillum was the first to suggest this trick, and used it when analyzing 9,000 pages of documents concerning then-Vice Presidential candidate Paul Ryan. In that case, there were many different kinds and lengths of documents within the huge stack of paper he received. Manual splitting was out of the question because it would have been much too time consuming, and there was no easy way to automate the task. Making Overview sort “pages” instead of “documents” was the simple solution, and it worked great.

This might seem counter-intuitive; how could it possibly work to ignore the boundaries of the original documents? But in fact “document” is a somewhat vague term. When we’re looking through a lot of material, we need to chose a “unit of analysis,” and the ideal unit might not be a “document,” especially if the documents are long. For example, we might want to analyze books at the level of chapters, or contracts at the level of paragraphs, or legislation at the level of sections. A “page” may not conform to any such natural division, but it’s a conveniently sized chunk of text to compare to other, similarly sized chunks; if you can imagine that it would be useful to break the source material into pages and then sort the pages by topic, then this splitting trick will work. It’s not perfect, but it’s simple, fast, and works on any type or length of document.

[Updated 2014-6-4 because page splitting now works for PDF uploads, not just DocumentCloud project import]

How to use Overview to analyze social media posts

Even when 10,000 people post about the same topic, they’re not saying 10,000 different things. People talking about an event will focus on different aspects of it, or have different reactions, but many people will be saying pretty much the same thing. People posting about a product might be talking about one or another of its features, the price, their experience using it, or how it compares to competitors. Citizens of a city might be concerned about many different things, but which things are the most important? Overview’s document sorting system groups similar posts so you can find these conversations quickly, and figure out how many people are saying what.

This post explains how to use Overview to quickly figure out what the different “conversations” are, how many people are involved in each, and how they overlap.

Step 1: Get the data into Overview

Overview does not capture or scrape social media. Fortunately, there are lots of other monitoring tools to do that, such as Radian 6, Sysomos, and Datasift. Overview can read data exported from each of these tools as CSV files. This is a simple file format used by spreadsheet programs, so if you can open it in Excel, you should be able to import it to Overview. Or, you can create your own files in this format.

Then upload the file into overview. Select “Import a New Document Set” as usual, then choose “CSV upload” and select your file. Overview will check the file for errors and show a preview of the file, like this:

If it all looks good, click Upload.

Step 2: Explore your documents

Overview reads the full text of each post — not just names and keywords — and groups similar documents together into different folders. Then it takes each folder, and groups similar documents into sub-folders, and so on. The folder tree shows how Overview has organized your posts.

There won’t always be strictly one type of “conversation” per folder, because conversation is a fuzzy concept, and Overview doesn’t know how you want to categorize things. But all the documents in the folder will be the same in some way — or if the folder seems to include too many different types of posts, try exploring the sub-folders, which have narrower topics.

As you select each folder, Overview shows a list of the posts in that folder in the bottom left. Each post is summarized by the most “characteristic” words in that post — not the words that are most common, but the words that make the documents in that folder different from the documents in other folders.

It’s the tags you create that define which conversations you’re interested in. These might be very broad categories such as “positive” and “negative,” or much more specific tags such as “didn’t like the color.” Generally, you’ll explore the documents top-to-bottom and left-to-right, inventing and assigning tags as you go. You can assign tags to one document at a time if you like, but usually you’ll want to assign a tag to all documents in the folder at the same time. To do this, select the folder, then move the mouse over the tag and press the + button.

Tags are non-exclusive, meaning that each document can have as many tags as you like.

Step 3: Export your data

Now that you’ve seen what’s in your data, and tagged it for future reference, you probably want to use these results in some way. You can see how many documents have each tag just by selecting that tag. Or, you can export your entire document set, including the tags you just added. The export button is on the document set list page. It creates a spreadsheet (CSV file) and gives you a choice how you want your tags.

Each document will be one row in the speradsheet. If you put all tags in one column they will be separated by commas, like “Sunshine, Positive.” But putting each tag in its own column makes it easy to do things like find all the documents that have a specific tag, or take column totals to get the number of documents with each tag — useful for making a visualization of your results, like this visualization of 16,000 tweets about drones (created with Overview by Tempero UK):