Step-by-step instructions for using Overview

To get started using Overview, you can watch this video or follow the steps below.

 

1. Get your documents into Overview.

  • If all of your documents are in PDF form you can simply upload them directly. Note that documents scanned from paper must first be OCR’d to turn their images into searchable text.
  • If you are a journalist, you can upload your documents to DocumentCloud, a free tool to upload, OCR, search, store, and publish documents in many formats.
  • You can also upload your documents as a CSV file, a type of speadsheet file that you can save from Excel, export from a database system, or create manually in a text editor. Here’s lots more loading documents via CSV.

For more, see the complete guide to getting your documents into Overview.

A useful trick for uploading many documents simultaneously: when the file dialog box opens you can select all of the documents in a folder simultaneously by clicking on the first file, then shift-clicking on the last flie (or pressing Control-A on Windows, or Command-A on Mac).

Overview keeps all uploaded documents private, unless you share them explicitly.

 

2. Explore the documents in the tree view

Overview’s main screen is divided into four parts: the folder tree, search field, tag list, and document viewer.

You can navigate through the folders in the tree with the arrow keys, or by clicking. Each folder is labelled by the keywords that best describe the documents filed under that folder. The label also tells you if MOST, SOME, or ALL of the documents in that folder contain each keyword. A folder’s sub-folders contain, collectively, all of the documents in the parent folder, broken down into increasingly narrow topics.

The document viewer shows either a particular document or a list of selected documents. Each document in the list is summarized by a list of keywords specific to that document.

If you know what you’re looking for, enter your query in the “search” box and Overview will show you where documents containing that term appear in the tree.

The tree automatically expands and zooms to follow your selections. Or you can pan it by dragging with the mouse, and zoom using the +/- buttons or the mouse wheel. Folders marked with ⊕ can be expanded to show sub-folders, while ⊖ hides sub-folders.

 

3. Tag interesting documents
As you explore the folder tree, you’ll run across individual documents or entire folders you want to remember. Enter a descriptive tag in the “new tag” field and press “tag.” If you’re currently viewing a specific document, overview will tag just that document. If instead you’re viewing the list of documents in a folder, Overview will tag the entire folder.

Here’s the complete guide to tagging, including how to export tags to create different types of visualizations.

Tags and folders have independent lives: each document can have any number of tags applied to it, and the same tag can be applied anywhere in the tree.

Once you’ve created a tag, you can add that tag to the current document or document list at any time by pressing the + button that appears when your mouse is over the tag name. Or press – to remove the tag.

Clicking on a tag name selects that tag, highlighting the tagged documents in the tree and loading them into the document list.

 

4. Work your way through the tree
When you have a lot of documents, it pays to be systematic. We recommend working your way through the folders in the tree from left to right — biggest folders to smallest folders. Select a folder then view a few of the documents in it to see if you understand what they have in common. If specific words appear in MOST or ALL documents in a folder, that’s a sign that the folder contains a single meaningful topic. Otherwise there may be more than one important topic in the documents in that folder, so try opening child folders instead until you find a folder where all of the documents are similar. Then tag that folder with a descriptive label.

Use search to find specific documents of interest, but pay attention to which folders contain those documents. You may find other relevant documents in the same folder, even if they don’t contain your search term.

As you proceed, you may find documents that talk about similar topics in different folders. Overview doesn’t know what you want out of your documents, so it can’t always guess how they should be arranged. You can apply a tag to any combination of folders and documents to create a set that is meaningful to you.

You may also discover that the documents in a folder are irrelevant to your work, in which case you can tag them with “read” and simply move on. Part of the power of Overview is being able to decide not to look at an entire folder.

When you’re finished this process, you’ll have a neatly categorized tree, and a set of tags corresponding to all the interesting topics in your documents.

 

5. Learn more!

Overview has many more powerful features: you can automatically split long documents into individual pagesignore meaningless words, compare data to text, and many other things. See the help for more tips and tricks, or contact us to ask about your specific needs!

 

We're hiring a front-end developer

The Overview Project, an open-source document mining system, is looking for a front end developer

Journalists are increasingly confronted with huge sets of documents that they have to understand quickly. These documents come from Freedom of Information requests, leaks, or open government sites, and consist of thousands or even millions of pages of disorganized documents in any file format. Overview is an open-source tool to help investigative journalists and other curious people find the essential information in a huge document dump.

The software analyzes the full text of each document using natural language processing techniques, automatically sorts documents into topics and sub-topics, and visualizes their content. It has been used to report on emails, declassified archives, tweets, and more.  Overview includes full text search, but unlike a search engine it is designed to help you find what you don’t even know you’re looking for.

We need an additional front-end developer on the team. Overview is written in Scala on the Play framework, with a Coffeescript front end. We’re looking for:

  • Solid JavaScript engineering experience, with modern tools such as jQuery, Coffeescript, and Backbone
  • Experience with a modern MVC web app architecture, such as Rails, Django, or Play
  • It’s open source! Are you good at supporting a developer community?
  • Design and usability sense;  you’ll be making many decisions at the intersection of beauty and function.
  • An understanding of web application architecture. Stuff like AWS and Postgres.
  • Bonus geek points: Experience in visualization, natural language processing, or distributed systems

You’ll be on a small team using agile processes, which means you’ll have a great deal of influence over the product and its architecture. Perks include travel to data journalism conferences and flexible working arrangements. New York City area preferred, but will consider remote. Mostly, we’re looking for someone who cares about making it easier for investigative journalists to do their job. Open data is great, but transparency means nothing if no one is watching.

Overview is an open-source project of the Associated Press, funded by a News Challenge grant from the Knight Foundation.

Please send resumes to jonathan@overviewproject.org

How to process documents that contain more than one language

Overview supports several different languages, but you can only pick one language per document set. Fortunately, there is an easy workaround to analyze a document set that contains several different languages.

The trick is to paste a stop words list into the “words to ignore” box. Stop words are the short, common, grammatical words in a language such as “a” and “for” in English, or “un” and “soy” in Spanish. Overview automatically ignores the stop words from whatever language you tell it to use. This is necessary, otherwise you would always get a folder labelled “MOST: the” when processing English documents. Overview only removes stop words form one language at a time, but you can get exactly the same effect by pasting in stop words for other languages.

This trick can also be used to process documents in a language that Overview doesn’t officially support yet!

Suppose you have a document set containing English and French text. You can tell Overview that the documents are in English, then paste in a French stop words list in the “words to ignore” box. Separate the words with spaces or put them on different lines. The result should look like this:

You can find stop words lists for many languages here. Simply cut and paste the words for the languages your documents include, as many different languages as you want. (There is no need to paste in stop words for the language you have told Overview to use, as the system adds those stop words automatically.)

This isn’t a perfect technique, because some stop words in one language can be legitimate words in another language, but it will get you 95% of the way there. Most importantly, it will allow you to use Overview on multi-language documents right now, before we develop a more integrated solution. As noted above, you can also process documents in any language, not just the ones Overview supports.

 

PDF upload: the easiest way yet to get your documents into Overview

Your big pile of documents might arrive in many different forms — from a stack of paper to an archive of random files. But PDFs are a popular document file format that every reporter, researcher or analyst has to work with sooner or later. Now you can upload them directly into Overview

Just choose “Upload PDF files” from the “Import Documents” menu, then “Add files” to open a file selection box. You’ll want to upload more than one file of course, so you can select all files in a directory by pressing Control-A (Windows) or ⌘-A (Mac). Or you can select multiple specific files using the keyboard and mouse in the usual way (if you’re not familiar with how to do that, here are instructions for Windows and Mac)

You can press “Add Files” as many times as you like. Overview will begin uploading  files as soon they are added, and then proceed to clustering and visualization when you press “Done Adding Files” to set the import options (such as language and words to ignore.)

Overview treats each file as “document,” so if you have one long PDF with many documents within it, you will want to split them first. There are several free tools to do this, both web-based and command line. We will soon add the ability to split documents into pages automatically, as is already possible when importing from DocumentCloud. Also, Overview does not (yet!) do OCR, which is the process of making a scanned image searchable. If you can search your PDFs or cut and paste text from them, they do not need OCR and Overview will be able to handle them. Otherwise, you can use commercial products such as Abby and Omnipage to do the OCR on your own computer before uploading.

Other ways to get your documents into Overview

You can still import documents in several other ways:

  • You can import a DocumentCloud project
  • Your can upload a CSV file. CSV files are a general data transfer format that most applications can write, including SQL databases, Microsoft Excel, and social media data sources like DataSift or Radian 6.
  • We’ve also written a powerful script which will scan a folder for documents in many different formats, automatically OCR if needed, and produce a CSV file for Overview. Check out docs2csv if this fits your problem.