Importing documents from a CSV file

There are different ways to get your documents into Overview. This post is about loading documents into Overview using the file CSV file format, and the format that Overview expects.

The quick answer

Overview expects all documents in a single CSV file, one document per row, plus a header row. Overview can read the following columns:

  • text — this is the only required column, and must contain the document text.
  • title — This is displayed when viewing the document. Documents are sorted by title in the document list.
  • url — If the URL begins with https Overview will display the page when viewing the document. Otherwise Overview will display the plain text and a source link.
  • id — ignored by Overview, but saved when the document set is exported, so you can match against other tables.
  • tags — a comma-separated list of tags applied to each document. Great for comparing text to data.
  • Every other column will appear as editable document metadata.

The “text” column must exist. (It may also be named “snippet” or “contents” for compatibility with files exported by Radian 6 and Sysomos). All other columns are optional.

Need more details? Read on.

Creating a CSV file

Many programs can save a CSV file.  For example, Excel can save a spreadsheet as a CSV file. Be sure to include the column names in the first row. You can also export a CSV file from a MySQL database.

Overview can probably read just about any CSV file created with any program, if you can ensure that the header is right. You may need to rename existing columns to the names that Overview expects, or you may need to add the header row entirely because some programs do not write a header. To do this, edit the CSV file with any text editor such as Notepad or TextEdit — but not Microsoft Word, as you will probably break the CSV file if you try to edit it in a word processor.

What a CSV file for Overview should look like

A CSV file is simply a list of “comma-separated values,” organized into rows and columns, like a spreadsheet or a table. The file starts with a list of the column names, separated by commas. This is followed by each row of data, one row per line, with the values for each column again separated by commas. Overview only requires one column, which much be named “text.” Here is an example file:

This is the content of the first document.
And here is the text of document the second
Document three talks about quick brown foxes.

If the text of a document spans multiple lines, or itself contains commas, then it needs to be quoted. Quotes inside a quoted document must be “escaped” by turning them into double quotes. This is all standard CSV stuff, and any program or library that writes CSVs should do it automatically.

"This document is really long and crosses multiple lines and contains
commas, which is why it is quoted."
"This is the second document. I'd like to say ""Hi!"" to everyone
to show how to put quotes inside a quoted document. The text of
this document can cross as many lines as needed, or even contain
blank lines like this:

The second document ends with this final quote."
The third document fits all on one line so no quotes needed.
"The fourth document has a comma in it, so it's quoted too."

Overview will display the text in the viewer pane when you click on each document. If you want to display something else for the document, you can add a “url” column which tells Overview to load a particular web page when you view that document. For security reasons, this must be an https URL. Here’s an example using tweets:

New deploy today -- cleaner clustering, better handling of larger document sets. Anyone got a pile of PDFs they want to look at? Try it!,
"""“I’m not going to sit out on the newsroom floor and sort pages into stacks of documents"" ~@jackgillum on need for document mining software.",

There are three more columns that Overview can read. You can add a “title” to each document, which Overview displays in the document list. Documents are sorted by title, so this is a way to control the order that documents are listed.  You can add a unique ID column, simply named “id”, which Overview will read and associate with the document, and export with the document set. And you can add a comma separated list of “tags” if you want to import documents with tags already applied, or want to compare text to data.

The columns can be in any order; all that matters is that the order of the column names matches the order of the data.

Every other column will appear as editable document metadata. When Overview exports a document set, it writes out id, text, title, url, and tags fields, as well as any other fields you imported, or created within Overview.

Uploading your CSV file to Overview

First, select the upload option from the document set list page:

Then choose a file. Overview will show a preview and do some basic checks to ensure that the format is OK. It should look like this:

After these checks you will see the usual import options, then a preview of the file contents. You may need tell Overview what character encoding the file uses. Try changing this if you see funny square characters in the preview, or accents aren’t displaying right. Then hit upload, and away we go.

The document mining Pulitzers

Four of the winners and finalists of the 2014 Pulitzer prize in journalism were based on reporting from a large volume of documents. Journalists have relied on bulk documents since long before computers — as in the groundbreaking work of I. F. Stone — but document-driven reporting has blown up over the last few years, a confluence of open data, better technology, and the era of big leaks.

What I find fascinating about these stories is that they show such different uses and workflows for document mining in journalism. Several of them suggest directions that Overview should go.

Overview used for Pulitzer finalist story

We were delighted to learn that Overview was used on the sole finalist in the Public Service category, an investigation into New York state secrecy laws that hide police misconduct. Adam Playford and Sandra Peddie of Newsday reported the series over many months, and Playford wrote about his use of Overview for Investigative Reporters and Editors:

We knew early in our investigation of Long Island police misconduct that police officers had committed dozens of disturbing offenses, ranging from cops who shot unarmed people to those who lied to frame the innocent. We also knew that New York state has some of the weakest oversight in the country.

What we didn’t know was if anyone had ever tried to change that. We suspected that the legislature, which reaps millions in contributions from law enforcement unions, hadn’t passed an attempt to rein in cops in years. But we needed to know for sure, and missing even one bill could change the story drastically.

Luckily, I’d been playing with Overview, a Knight Foundation-funded Associated Press project that highlights patterns within piles of documents. Overview simplified my task greatly — letting me do days’ worth of work in a few hours.

Almost instantly, Overview scanned the full text of all 1,700 bills and created a visualization that split the bills into dozens of groups based on the most unique words that appeared in each bill. This gave me an easy way to skim through the bills in each group by title.

Ultimately, Playford and Peddie were able to prove that lawmakers had never addressed the problem. This is a tremendous story, exactly the sort of classic accountability reporting that journalism is supposed to be about.

The document mining process is particularly interesting because the reporters needed to prove that something was not in the documents, which is not a task that we envisioned when we began building Overview. But because Overview clusters similar documents, they were able to complete an exhaustive review much more quickly because they could often discard obviously unrelated clusters.

The Snowden files: software is an obvious win

The winner in the Public Service category was a blockbuster: The Guardian and The Washington Post won for their ongoing coverage of the NSA documents. Snowden has never publicly said exactly how many documents he took with him, but The Guardian says it received 58,000 at one point. This story typifies a new breed of document-driven reporting of that has emerged in the last few years: a large leak, already in electronic form, with very little guidance about what the stories might be.

Finding what you don’t know to look for is an especially tricky problem with such unasked-for archives — as opposed to the result from a large FOIA request, where the journalist knows why they asked for the documents. Clearly you need software of some sort to do reporting on this type of material.

The Al-Qaida papers: we need a workflow breakthrough

Rukmini Callimachi of the Associated Press was a finalist for her stories based on a cache of Al-Qaida documents found in the remote town of Timbuktu, Mali. Thousands of documents were found strewn through a building that the fighters had occupied for more than a year. Unlike the digital Snowden files these were paper materials, several trash bags full of them, which had to be painstakingly collected, cataloged, scanned, and translated.

Through these documents we’ve learned of Al-Qaida’s shifting strategy in Africa, their tips for avoiding drones, and that members must file expense reports.

We had a chance to talk through the reporting process with Callimachi, and we’ve come to the conclusion that the bottleneck in such “random pile of paper” stories is the preservation, scanning and translation process. The reporters on the similarly incredible Yanuleaks documents — thrown into a lake by ousted Ukrainian president Viktor Yanukovych — face similar challenges. I still believe in the power of good software to accelerate these types of stories, but we need a breakthrough in workflow and process, rather than language analysis algorithms. Could we integrate scanning and translation into a web app? Maybe using a phone scanning stand, and a combination of computer and crowdsourced translation?

America’s underground adoption market: counting cases

The final document-driven Pulitzer finalist covered a black market for adopted children. Megan Twohey of Reuters analyzeed 5000 posts from a newsgroup where parents traded children adopted from abroad.

I’ve started calling this kind of analysis a categorize and count story. Reuters reporters created an infographic to go with their story, summarizing their discoveries: 261 different children were advertised over that time, from 34 different states. Over 70% of these children had been born abroad, in at least 23 different countries. They also documented the number of cases where children were described as having behavioral disorders or being victims of physical or sexual abuse.

When combined with narratives of specific cases, these figures tell a powerful story about the nature and scope of the issue.

5000 posts is a lot to review manually, but that’s exactly what Twohey and her collaborators did. Two reporters independently read each post and recorded the details in a database, including the name and age of the child, the email address of the post author, and other information. After  a lengthy cleaning process they were able to piece together the story of each child, sometimes across many years and different poster pseudonyms.

There may be an opportunity here for information extraction or machine learning algorithms: given a post, extract the details such as the child’s name, and try to determine whether the text mentions other information such as previous abuse.

But no one has really tried applying machine learning to journalism in this way. We hope to add semi-automated document classification features to Overview later this year, because it’s a problem that we see reporters struggling with again and again.

We’ll see more of this

I’m going to end with a guess: we’ll see several document-driven stories in next year’s Pulitzers, because we’ll see many more such stories in journalism in general. I make this prediction by looking at several long term trends. Broadly speaking, the amount of data in the world is rapidly increasing. Open data initiatives offer a particular trove for journalists because they are often politically relevant, and governments across the world are getting better at responding to Freedom of Information Requests (even in China.) At the same time, we’ve entered the era of the mega-leak: an entire country’s state secrets can now fit on a USB drive.

Much of this flood is unstructured data: words instead of numbers. Some technologists argue that we can and should eventually turn all of this into carefully coded databases of events and assertions. While such structuring efforts can be very valuable, we can expect that unstructured text data will always be with us because of its unique flexibility. Emails, instant messages, open-ended survey questions, books: ordinary human language is the only format that seems capable of expressing the complete range of human experience, and that is what journalism is ultimately about.

Meanwhile the technology for reporting on bulk unstructured material is improving rapidly. Overview is a part of that, and we’re aiming specifically at users in journalism and other traditionally under-resourced social fields. Between greater data availability and better tools, I have to imagine that we’ll see a lot more document-driven journalism in the future. To my eye, this year’s Pulitzers are a reflection of larger trends.

[Updated 2014-5-6 with a more accurate description of the reporting process for the Reuters story.]



Overview's response to the Heartbleed security vulnerability

UPDATE: we have installed our new SSL certificates. If you are an Overview user, you should have received and email asking you to reset your password, by clicking on the reset it link on the login form. Please reset your password! If you are concerned that someone may have gained unauthorized access to your documents, we can work you to audit our server logs to see if anyone who wasn’t you used your password.

This completes Overview’s recovery from Heartbleed.

You may have heard that, a few days ago, a serious bug called Heartbleed was discovered in a piece of the software that powers much of the web, including Overview.

This bug could allow an attacker to intercept and decode secured connections to our server, and thereby gain access to your password and then your private documents. Due to the nature of this bug there is no way for us to know if any accounts have been compromised.

We have already upgraded our servers so they do not have this vulnerability. Unfortunately, if anyone compromised our secure connections previously they may still be able to do so. We are working with our provider to get new SSL certificates to fix this problem. We are told this will take a few days.

When this is done, we will send out a mass email asking everyone to reset their password.

We apologize for the inconvenience. It’s a breathtaking bug, and we and the rest of the web are recovering as fast as we can.




VIDEO: What the Overview Project does

Here is my talk from the wonderful Groundbreaking Journalism conference in Berlin last week, plus the panel afterwards. This is a great short introduction to what the Overview Project has done, and where we are going — we see ourselves as a pipeline from the AI research community to usable applications in the social sector.

My talk is 15 minutes, followed by a panel on “What software does journalism need?”

View the same documents in different ways with multiple trees

Starting today Overview supports multiple trees for each document set. That is, you can tell Overview to re-import your documents — or a subset of them — with different options, without uploading them again. You can use this to:

  • Focus on a subset of your documents, such as those with a particular tag or containing a specific word.
  • Use ignored and important words to try sorting your documents in different ways.

You create a new tree using the “New Tree” link above the tree:

This brings up a dialog box that looks very similar to the usual import options. You can name the tree (good for reminding yourself why you made it) and set ignored and important words to tell Overview how you want your documents organized in this tree. You can also include only those documents with a specific tag.

To create a tree that contains only words matching a particular search term, first turn your search into a tag using the “create tag from search results” button next to the search box.

Tags are shared between all of the trees created from a document set. That means when you tag a document in one tree, it will be tagged in every other tree. You can try viewing your documents with different trees, tagging in whatever tree is easiest to use.

After you create a tree, you can get information about what you created by clicking the little (i) on the tab for that tree:


Who will bring AI to those who cannot pay?

One Sunday night in 2009, a man was stabbed to death in the Brentwood area of Long Island. Due to a recent policy change there was no detective on duty that night, and his body lay uncovered on the sidewalk until morning. Newsday journalist Adam Playford wanted to know if the Suffolk County legislature had ever addressed this event. He read through 7,000 pages of meeting transcripts and eventually found the council talking about it:

the incident in, I believe, the Brentwood area…

This line could not have been found through text search. It does not contain the word “police” or “body,” or the victim’s name or the date, and “Brentwood” matches too many other documents. Playford read the transcripts manually — it took weeks — because there was no other way available to him.

But there is another way, potentially much faster and cheaper. It’s possible for a computer to know that “the incident in Brentwood” refers to the shooting, if it’s programmed with enough contextual information and sophisticated natural language reasoning algorithms. The necessary artificial intelligence (AI) technology now exists. IBM’s Watson system used these sorts of techniques to win at Jeopardy, playing against world champions in 2011.

Last month, IBM announced the creation of a new division dedicated to commercializing the technology they developed for Watson. They plan to sell to “healthcare, financial services, retail, travel and telecommunications.”

Journalism is not on this list. That’s understandable, because there is (comparatively speaking) no money in journalism. Yet there are journalists all over the world now confronted with enormous volumes of complex documents, from leaks and open government programs and freedom of information requests. And journalism is not alone. The Human Rights Data Analysis group is painstakingly coding millions of handwritten documents from the archives of the former Guatemalan national police. UN Global Pulse applies big data for humanitarian purposes, such as understanding the effects of sudden food price increases. The crisis mapping community is developing automated social media triage and verification systems, while international development workers are trying to understand patterns of funding by automatically classifying aid projects.

Who will serve these communities? There’s very little money in these applications; none of these projects can pay anywhere near what a hedge fund or a law firm or intelligence agency can. And it’s not just about money: these humanitarian fields have their own complex requirements, and a tool built for finding terrorists may not work well for finding stories. Our own work with journalists shows that there are significant domain-specific problems when applying natural language processing to reporting.

The good news is that many people are working on sophisticated software tools for journalism, development, and humanitarian needs. The bad news is that the problem of access can’t be solved by any piece of software. Technology is advancing constantly, as is the scale and complexity of the data problems that society faces. We need to figure out how to continue to transfer advanced techniques — like the natural language processing employed by Watson, which is well documented in public research papers — to the non-profit world.

We need organizations dedicated to continuous transfer of AI technology to these underserved sectors. I’m not saying that for-profit companies cannot do this; there may yet be a market solution, and in any case “non-profit” organizations can charge for services (as the Overview Project does for our consulting work.) But it is clear that the standard commercial model of technology development — such as IBM’s billion dollar investment in Watson — will largely ignore the unprofitable social uses of such technology.

We need a plan for sustainable technology transfer to journalism, development, academia, human rights, and other socially important fields, even when they don’t seem like good business opportunities.

Use "important" and "ignored" words to tell Overview how to file your documents

Overview automatically files documents by topic. But you know things that the computer doesn’t, like what’s important for your particular documents. Now you can tell Overview that certain words are important when you import your documents.

This works in combination with the ability to ignore unimportant words. Suppose you’re looking at the White House emails about drilling in the Gulf of Mexico (one of Overview’s example document sets) and you’re specifically interested in environmental topics. You can enter words like “environment” and “environmental” in the important words box, like this:

Here we’ve used the “words to ignore” feature to tell Overview to ignore the names of the two main email writers, because we don’t want to organize emails by who sent them — just their contents. Then we’ve entered “environment” and “environmental” as important words to tell Overview that that’s what we want to look for. Note that we’ve also entered “Environment” and “Environmental” because the important words list is case-sensitive (ignore words are not case sensitive.)

Overview throws out the ignored words, then gives extra weight to any of the important words it finds. Usually it ends up filing all the documents containing important words in their own folder, like this:

Overview doesn’t put all documents containing the important words into their own folder. If a document contains “environment” but is much more closely related to other documents which do not, it will be filed with them instead. (You can always search to see where Overview has filed documents containing a particular word.)

Also, each important word might or might not get its own folder. Overview doesn’t know “environment” and “environmental” have similar meanings, but it does see that the documents containing these words are similar, so it puts them together.

You can also use Java regular expressions to find important words. For example you can create a folder for each all-uppercase ACRONYM by using the expression [A-Z]+.  Even if you’re not using regular expressions, important words are case-sensitive. (This is to make it easier to find names, which are often capitalized. In the future we’ll add a check box to turn this on or off.)

Taken together, ignored and important words are a powerful way to tell Overview how you want certain documents organized, while letting the computer make automatic decisions for the rest.

How big is a document dump?

When journalists end up with a huge stack of documents they need to sort through, how big is that stack? One of the fun things about working on Overview is we get firsthand experience with many of these stories, and get to talk a lot of nerdy shop about the others.

So here’s our casual list of document sets that journalists have had to contend with. I’ve thrown in the links where possible, and a description of how the documents were delivered.

  • U.K. MP expenses – 700,000 documents in 5,500 PDF files, from government website
  • Wikileaks Iraq war data – 391,832 structured records, each including a text descriptions.
  • Wikileaks diplomatic cables – 251,287 cables, each a few pages long
  • Military discharge records – 112,000 assorted files in just about every document file format
  • NSA files leaked by Snowden – 50,000 to 200,000 according to the NSA
  • Wikileaks Afghanistan war data – 91,731 structured records, same format as Iraq data
  • Free the Files – 43,200 political TV ad spending files, PDF scans of paper, from FCC website
  • Paul Ryan correspondence – 9000 pages, on paper, via FOIA request of more than 200 agencies
  • Tulsa PD emails – 8000 emails, in Outlook format, via FOIA request
  • Pentagon Papers – 7000 pages leaked 1973, on paper, now declassified and available
  • Illegal adoption market in U.S. – 5029 messages web-scraped from a Yahoo! forum
  • Iraq security contractor reports – 4,500 pages on paper, via FOIA request
  • North Carolina foodstamp woes – 4,500 pages of emails, on paper, via FOIA request
  • New York State proposed legislation – 1,680 bills, downloaded via government API
  • White House Gulf of Mexico drilling emails – 628 documents, mostly emails, on paper, via FOIA request
  • Dollars for Docs – 65 gigantic disclosure reports, mostly huge PDFs of tables

So how big is the typical document dump? Well, its, ah… how do you measure that? How does a “record” compare to a “page” or an “email”? The first thing I see when I look at this list is the huge variety of file formats, largely because we’ve had to spend so much time helping people with a huge variety of file formats (more on that).

And it’s not just digital file formats. There’s actual paper involved in this business. Paper is a very popular choice for answers to FOIA requests. This is partially a technology problem and partially just that paper is still really good at certain things, like making absolutely 100% sure your redactions cannot be undone and the document metadata has been stripped. And even when you do get an electronic format, you might end up with a single massive PDF with thousands of variable-length emails (in which case do this to load it into Overview.)

But we do have some numbers here, and maybe a page and an email might be about the same-ish amount of work to deal with, so let’s imagine it’s all comparable and call it all pages. Some sets are very large, up to 700,000, but most are in the 5000-10000 page range. I I’ll take a median instead of an average since the distribution is highly skewed, and… 9000.

The most typical size of document dump that journalists have to deal with is 9000 pages. At least, most typical of our collection. Half our cases are larger than that and half are smaller. The largest document sets that journalists work with are in the million range now, and we should expect that to incerase. (See also: how to configure Overview to handle more documents.)

9000 pages would take 150 hours to read through completely at a rate of one per minute, or about 20 eight-hour workdays. The largest document set on this list (the MP expenses) would take almost exactly four years to read if you worked every single day. This is why we need computers.

Keyboard shortcuts in Overview

It’s a little known fact that Overview has several keyboard shortcuts to make navigating through your documents even faster:

  • j, k — view next and previous document in the list.
  • arrow keys — navigate through tree. Selects parent, child, and sibling folders.
  • u — go back to document list, when viewing a single document.

Both of these sets of keys are essential for rapid review. You can select a folder, press j to read the first document (which automatically switches from the document list to the single document view) and then press right arrow to go to the next folder in the tree.

Algorithms are not enough: lessons bringing computer science to journalism

There are some amazing algorithms coming out the computer science community which promise to revolutionize how journalists deal with large quantities of information. But building a tool that journalists can use to get stories done takes a lot more than algorithms. Closing this gap has been one of the most challenging and rewarding aspects of building Overview, and I really think we’ve learned something.

Overview is an open-source tool to help journalists sort through vast troves of documents obtained through open government programs, leaks, and freedom of Information requests. Such document sets can include hundreds of thousands of pages, but you can’t find what you don’t know to search for. To solve this problem, Overview applies natural language processing algorithms to automatically sort documents according to topic and produce an explorable visualization of the complete contents of a document set.

I want to get into the process of going from algorithm to application here, because — somewhat to my surprise — I don’t think this process is widely understood.  The computer science research community is going full speed ahead developing exciting new algorithms, but seems a bit disconnected from what it takes to get their work used. This is doubly disappointing, because understanding the needs of users often shows that you need a different algorithm.

The development of Overview is a story about text analysis algorithms applied to journalism, but the principles might apply to any sort of data analysis system. One definition says data science is the intersection of computer science, statistics, and subject matter expertise. This post is about connecting computer science with subject matter expertise.

The algorithmic honeymoon

In October 2010 I was working at the Associated Press on the recently released Iraq War Logs. AP reporters toiled for weeks with a search engine to find stories within these 391,832 documents. It was painful, and there had to be a better way.

Rather than deciding what to look for I wanted the computer to read the documents and tell me what was interesting. I had a hunch that classic techniques from information retrieval (TF-IDF and cosine similarity) might work here, so I hacked together a proof-of-concept visualization of one month of the Iraq War Logs data using Ruby and Gephi.

And it worked! By grouping similar documents together and coloring them by incident type we were able to see the broad structure of the war. It was immediately clear that most of the violence was between civilians, and we found clusters of events around tanker truck explosions, kidnappings, and specific battles.

A few months later we had a primitive interactive visualization. It was exciting to see the huge potential of text analysis in journalism! This was our algorithmic honeymoon, when the problems were clear and purely technical, and we took big steps with rapid iterations.

But that demo was all smoke and mirrors. It was the result of weeks of hacking at file formats and text processing and gluing systems together and there was no chance anyone but myself could ever run it. It was research code, the place where most visualization and natural language processing technology goes to die. No one attempted to do a story with the system because it wasn’t mature enough to try.

Worse, it wasn’t even clear how you would do a story starting from one of these visualizations. Yes, we could see patterns in the data, but what did those patterns mean and how would we turn them into a story? In retrospect, this uncertainty should have told us that despite our progress in algorithms, we didn’t yet understand the journalism part of the problem.

Getting real work done

The next step was a prototype tool, initially developed by Stephen Ingram at UBC and completed by the end of 2011. This version introduced the topic tree and its folders for the first time. And I had a document set: 4,500 pages of recently declassified reports concerning private security contractors in Iraq. Trying to do a story about these documents taught us a lot about the difference between an algorithm and an application.

The moment I began working with these documents — as a reporter, not a programmer — I discovered that it was stupendously important to have a smooth integrated document viewer. In retrospect it seems obvious that you’ll need to read a lot of documents while doing document mining, but it was easy to forget that sort of thing in the midst of talking about document vectors and topic modeling and fancy visualizations. I  also found that I needed labels to show the size of each cluster, got frustrated at the overly complex tagging model, and implemented more intuitive panning and zooming in the scatterplot window. A few weeks of hacking eventually got me to a system I could use for reporting.

The final story included the results of this document set analysis, reporting from other document sources, and an interview with a State Department official. This was the first time we were able to articulate a reporting methodology: I explored the topic tree from left to right, investigating each cluster and tagging to mark what I’d learned, then followed up with other sources. The aim of the reporting process was to summarize and contextualize the content of a large number of documents. This was a huge step forward for Overview, because it connected the very abstract idea of “patterns in the data” to a finished story. We thought all document set reporting problems would take this form. We were wrong.

Just as the proof-of-concept was research code, the prototype was the kind of code you write on deadline to get the story done by any means necessary. The data journalism community often writes and releases code written for a single story. This is valuable because it demonstrates a technique and might even provide building blocks for future stories, but it’s usually not a finished tool that other people can easily use.

We learned this lesson vividly when we tried to get people to install the Overview prototype. Git clone, run a couple of shell scripts to load your documents, how hard could it be? It turned out to be very hard. At NICAR 2012 I led a room full of people through installing Overview and loading up a sample file. We had every type of problem: incompatible versions of git, Ruby, and Java; operating system differences; and lots of people who had never used a command line before. Of 20 people who tried, only 3 got the system working. We were beginning to make contact with our user community.

Usability trumps algorithm

We re-wrote Overview as a web application to solve our installation woes (largely the work of Jonas Karlsson and Adam Hooper). We also dropped the scatterplot visualization, the visualization that we had started with, because log data and user interviews showed no one was using it. We went all-in on the tree and had a deployed system by the end of 2012.

Do you understand what is happening in this screenshot? Is it clear to you that the window on the lower left is a list of documents, each represented a line of extracted keywords? It wasn’t obvious to our users either, and no one used this new system for many months.

We knew that Overview was useful, because we and others had done stories with the prototype. But we were now expecting new people to come in fresh and learn to use the system without our help. That wasn’t happening. So we did think-aloud usability testing to find out why. We immediately discovered a number of serious usability problems. People were having a hard time getting their documents into Overview. They didn’t understand the document list interface. They didn’t understand the tree.

We spent months overhauling the UI. We hired a designer and completely rebuilt the document list. And based on user feedback, we changed the clustering algorithm.

During the prototype phase we had developed a high-performance document clustering algorithm based on preferentially sampling the edges between highly similar documents and building connected components, documented in this technical report. We were very proud of it. But it tended to produce long lists of very small clusters, meaning that each folder in the tree could have hundreds of sub-folders. This was a terrible way to navigate through a tree.

So we replaced our fancy clustering with the classic k-means technique. We apply this recursively, splitting each folder into at most five sub-folders according to an adaptive algorithm.The resulting tree is not as faithful to the structure of the data as our original clustering algorithm, but that doesn’t matter. Overview’s visualization is for humans, not machines. The point is not to have a hyper-accurate representation of the data, but a representation that users are able to interpret and trust. For this reason, it was absolutely necessary to be able to explain how Overview’s clustering algorithm works for a non-technical audience.

What do journalists actually do with documents?

We solved our usability problems by the summer of 2013 and journalists began to use our system; we’ve had a great crop of completed stories in the last six months. And as we gained experience we finally began to understand what journalists actually do with a set of documents. We have seen four broad types of document-driven stories, and only one of them is the “summarize these documents” task we originally thought we wanted to support. In other cases the journalist is looking for something specific, or needs to classify and tag every document, or is looking to separate the junk from the gold.

Today we have a solid connection to our users and their problems.  Our users are generally not full-time data journalists and have typically never seen a command line. They get documents in every conceivable format, from paper to PDF. Even when the material is provided in electronic form it may need OCR, or the files may need to be split into their original documents.  Our users are on deadline and therefore impatient: Overview’s import must be extremely quick or reporters will give up and start reading their documents manually. And each journalist might only use Overview once a year when a document-driven story comes their way, which means the software cannot require any special training.

We learned what journalists actually wanted do, and we implemented features to do it. We implemented fuzzy search to help find things in error-prone OCR’d material. We added an easy way to show the documents that don’t yet have tags for those projects where you really do need to read every page. And Overview now supports multiple languages and lets you customize the clustering. We are still working on handling a wide range of import formats and scenarios including integrated OCR.

This is what the UI looks like today.

Algorithms are not enough

Overview began when we saw that text analysis algorithms could be applied to journalism. We originally envisioned a system for stringing together algorithmic building blocks, a concept we called a visualization sketching system. That idea was totally wrong, because it was completely disconnected from real users and real work. It was a technologist’s fantasy.

Unfortunately, it appears that much of the natural language processing, machine learning, and visualization community is stuck in a world without people. The connection between the latest topic modeling paper and the needs of potiential users is weak at best. Such algorithms are evaluated by abstract statistical scores such as entropy or precision-recall curves, which do not capture critical features such as whether the output makes any sense to users. Even when such topic models are built into exploratory visualization systems (like this and this) the research is typically disconnected from any domain problem. While it seems very attractive to build a general system, this approach risks ignoring all real applications and users. (And the test data is almost always academic papers or news archives, both of which are unrealistically clean.) We are seeing ever more sophisticated technique but very little work that asks what makes one approach “better” than another, which is of course highly dependent on the application domain.

There is a growing body of work that recognizes this. There is work on designing interpretable text visualizations, research which compares document similarity algorithms to human ratings, and evolving metrics for topic quality that have been validated by user testing.  See also our discussion of topic models and XKCD. And we are beginning to see advanced visualization systems evaluated with real users on real work, like this.

It’s also important to remember that manual methods are valuable. Reporters will spend days reading and annotating thousands of pages because they are in the business of getting stories done. Machine learning might help with categorize-and-count tasks, but the computer is going to make errors that may compromise the accuracy of the story, so the journalist must review the output anyway. The best system would start with  a seamless UI for manual review and tagging, then add a machine learning boost. This is exactly what we are pursuing for Overview.

Our recommendation to technologists of all stripes is this: get out more. Don’t design for everyone but for specific users. Make the move from algorithmic research to the anything-goes world of getting the work done. Optimize your algorithms against user needs rather than abstract metrics; those needs will be squishy and hard to measure, but they’ll lead you to reality. Test with real data, not clean data. Finish your users’ projects, not your projects. Only then will you understand enough to know what algorithms you need, and how to build them into a killer app.