Five things you didn’t know Overview could do

1. Find names of places and companies

The Entities plugin will automatically find company names, place names (in multiple languages!), numbers, or just unusual words that aren’t in the dictionary. Like all plugins, it’s available under Add View.

The entities plugin

Overview’s entity detection algorithms are designed to err on the side of including things that aren’t entities, rather than missing things which are — unlike normal NLP techniques which often miss 50% of entities. You can hit the little red X’s to remove junk from the list.

2. Make scanned PDFs searchable (OCR)

Overview will automatically OCR any PDF which doesn’t seem to have any text in it, such as scanned pages, using the open-source Tesseract engine. Scanned documents will be much slower to load — but you won’t be able to search them until you OCR them somewhere, so why not let Overview do it?

If you’d like to get OCR’d files out of Overview, you can simply export the documents after Overview has loaded them. You’ll get searchable PDFs back.

3. Customize the Word Cloud

You can use the delete tool to remove the words that aren’t adding anything. When you remove words, less common words are added to fill up the space. This way you can zero in on exactly what you want to investigate.

Edit word cloud

You can have more than one word cloud at a time, through the Add View menu. Press the Hidden Words button to unhide words.

4. The all-powerful Export

You can export all documents or just the result of the current search. For example, you could download only documents with the word “pizza” in them. And you can export either one document per file, or just the text (and any custom fields) as a CSV.

Export options

This means you can use Overview as a text extractor: upload random files, download a clean spreadsheet of the text. Or an OCR machine: upload random files, get searchable PDFs back.

5. Add custom data to each document

Overview now supports custom fields, or as we like to call it, document metadata.

You can add a field and set the value for all documents in a batch import.

Add field on import

Or you can edit the fields on one document at a time in the document viewer. If you add a field to one document, it will appear (initially blank) for all documents.

document field

Or, if you load your documents via CSV, Overview will read in each extra column as a field.

Each field will be its own column when you export as a spreadsheet.

Add fields during import

Now you can add custom fields to all documents at once while you’re importing. You can use this make some other notes about this batch of documents, such as tracking the source of each document in your set.

I’ll walk you through it.

First, add files as usual
Use the (new) “Fields” interface to specify fields for these documents.
Now, every document you uploaded has the field values you wrote.
You can specify other field values whenever you add more documents to the document set.
The original documents’ fields will have the original values. The new documents’ fields will have the new values.
Export the document set, and you’ll see the field values for all documents.

Fields — or document metadata — are a great feature, and hopefully this makes them a little more useful.

Overview now does OCR!

We’ve added a new feature into Overview: Optical Character Recognition (OCR). That means you can upload scanned PDFs and Overview will automatically read the text from them.

Overview decides when to use OCR automatically, on every page that has fewer than 100 characters of searchable text. This will make your uploads a lot slower, but you will need to OCR them anyway before you can search them, and you can’t beat the convenience.

Overview uses Tesseract for OCR, because Tesseract is free. Sometimes Tesseract produces more garbage characters than other OCR engines, such as the one included in Adobe Acrobat Pro. If you’ve already OCR’d your documents using another program, Overview will just read the previously created text.


How Overview handles pesky Microsoft Excel

Overview has long had a fairly important (if little-used) feature: it can export all documents into a spreadsheet. Day one, we wrote the spreadsheet in “comma-separated values” (“CSV”) format.

Then we realized that Microsoft Excel couldn’t open all CSV files.

Day two, we implemented a separate export file type, just for Microsoft Excel. Here are the differences.

Continue reading How Overview handles pesky Microsoft Excel

Import, edit, and create document metadata

Have you ever needed to extract the author or write notes for each document? Now you can, with fields.

The new “Fields” section sits in wait underneath each document. Click it, and you’ll be able to create fields and change their values.

The list of fields is the same for every document in a document set. Each document has its own field values.

You can create new fields directly in Overview, or import them as extra columns in your CSV (see: importing documents using a CSV file.)

All your fields will appear as new columns when you export a spreadsheet:

You can use a spreadsheet program to filter for field values.

The “fields” feature is new. We know there’s plenty of pizzazz to add:

  • Right now, we only support single-line text fields: no dates, numbers, geo-coordinates, or so forth. As a workaround, format your text values carefully (e.g., use YYYY-MM-DD for dates) so your spreadsheet program can grok them.
  • Overview’s search feature doesn’t examine field values.
  • You cannot create fields or write to fields using the API. (You can read field values with the API, though: they’re in document.metadata.)
  • You can only set metadata on one document at a time.

Hi! Now we’re

We’re changing to redirect to our new domain name,

Why the change? Two reasons:

  1. Overview isn’t just an experimental “project” any more. Overview is a go-to tool for doc-crunching.
  2. Overview isn’t a non-profit “.org“. Overview Services Inc. is a commercial company. Don’t get us wrong — we’d love a donation as much as anybody. But our consulting work keeps the lights on.

How does this affect you, our user? Well … uh … the text in your browser’s URL field will shrink by three characters. That’s about it.

Our automatic redirects will kick in on today, Monday, July 13, 2015, around noon. Don’t worry: you won’t lose any of your work, even if you’re using Overview while we switch.

Update, July 13, 2015: all done.

Overview’s Search Syntax

Overview supports phrase searches, fuzzy searches, and booleans. Here’s what you can search for in the search box and the Multi-Search plugin:

  • John Smith: All documents containing the phrase “John Smith“.  All the words, in order.
  • Pizza~: All documents matching the word “Pizza” or similar words such as “Piazza” or “Pizzas“. (“~” after a single word means fuzzy search. It can find documents that contain typos.)
  • John Smith AND Alice Smith: All documents containing both the phrase “John Smith” and the phrase “Alice Smith“. (“AND” means both phrases must appear.)
  • John Smith OR Alice Smith: All documents containing either the phrase “John Smith” or the phrase “Alice Smith” Or both phrases. (“OR” means any phrase must appear.)
  • John Smith AND NOT Alice Smith: All documents containing the phrase “John Smith” and not the phrase “Alice Smith“. (“NOT” means the phrase must not appear.)
  • Alice AND NOT (Bob OR Carol): All documents containing the phrase “Alice” and neither the phrase “Bob” nor the phrase “Carol“. (Parentheses help organize complicated queries.)
  • "John and Alice Smith": All documents containing the phrase “John and Alice Smith“. (Without quotation marks, it would have been interpreted as “(John) AND (Alice Smith)“. Quotation marks tell Overview to ignore operators such as AND, OR and NOT. You can use quotation marks or apostrophes.)
  • John Smith~2: All documents matching the phrase “John Smith” or phrases with the words John and Smith at most 2 words apart, such as “John 'The Culprit' Smith“. (“~N” after a multi-word phrase means proximity search.)
  • Smith*: All documents containing a word that begins with “Smith“, such as  “Smith“, “Smithy” or “Smithsonian“. (“*” after a phrase means prefix search.)
  • title:John Smith: All documents containing the phrase “John Smith” in their titles. The other way around is body:John Smith. By default, Overview searches every field.

In the coming months, we’ll be sitting with users to see how this new query language works for them. If you have any feedback about a particular query, please use the “Talk to Us” link at the top of Overview.

Powerful new plugins: Wordcloud and Multisearch

Starting today Overview has two new plugins pre-installed on You can access both through the Add View menu.

WordCloud is just that. You can click on a word to select all documents containing that word.

Multisearch looks for many search terms in all documents at once.

You can add items one by one, or click Edit entire list as text to paste in lots of search terms at once. After you’ve added a search you can edit the query by clicking the Edit link to the right. In this example, the search named “environmental” actually searches for “environmental OR environment”. You can also do fuzzy searches by adding a ~ (tilde) to the end of the query, like “Obama~” which can be incredibly useful for searching through documents that have been poorly scanned and OCRd. Multisearch supports the same advanced query syntax as Overview’s regular search.

Both of these plugins were written using Overview’s new plugin development API, and you can write your own!

Overview can now read most file formats directly

Previously, Overview could only read PDF files. (You can also import all documents in a single CSV file, or import a project from DocumentCloud.)

Starting today, you can directly upload documents in a wide variety of file formats. Simply add the files — or entire folders — using the usual file upload page.

Note that the “Add all files in a folder” button is only available when you are using the Google Chrome browser, due to limitations in browser support for this feature.

Overview will automatically detect the file type and extract the text. Your document will be displayed as a PDF in your browser when you view it. Overview supports a wide variety of formats, including:

  • PDF
  • HTML
  • Microsoft Word (.doc and .docx)
  • Microsoft PowerPoint (.ppt and .pptx)
  • plain text, and also rich text (.rtf)

For a full list, see the file formats that LibreOffice can read.

Stupid Tag Tricks

Overview’s tags are very powerful, but it may not obvious how to use them best. Here’s a collection of tagging tricks that have been helpful to our users, from Overview developer Jonas Karlsson.

Tracking documents to review

After you have reviewed a set of documents, tag them with “reviewed” in addition to any other tags you might want, such as “interesting” or “follow up.” Then you can instantly see what you have not reviewed by using the Show Untagged button

If you are working together with other people to review documents, you can add a tag called “In review by XY” when you start reviewing a folder. When review is complete, add the “Reviewed by XY” tag, and remove the “In review” tag. If the documents being reviewed by different team members overlap, these tags will make it easier to avoid duplicate work.

Grouping Tags

Tags are sorted alphabetically. To group tags, start the tag name with the same letter or punctuation: “+ a”, “+ x”, “* b”, “* y”, “* z”

Color code tags with the same or similar colors, to indicate similar concepts.  Use long tag names, to make selection less error prone (no accidental hitting the + or – buttons).

To change tag colors or names, open the Organize Tags dialog box by clicking on the link at the bottom of the tags pane, then click on the tag name or color to change.


Create a visualization from your tags

See the Export this list as a spreadsheet link at the top of  the Organize Tags dialog box? That will produce a CSV which lists how many documents have each tag, like this:

You can load this data into another program to visualize it. This is how Mick Conroy of TemperoUK created this analysis of the social media conversation around drones, by importing the tag data into this visualization software.

Tag all documents that do not contain tag “abc”:

  1. Tag all documents with a new tag “Not abc” (by selecting the top of the tree)
  2. Select the “abc” tag
  3. Remove the “Not abc” tag from the selected documents (click on the ‘-‘ on the “Not abc” tag)

Tag all documents that have tags “a” OR “b” OR “c”

  1. Select tag “a”
  2. Create a new tag “a or b or c”. All “a” documents should now have this tag.
  3. Select tag “b”
  4. Add the “a or b or c” tag (click on the ‘+’ on the “a or b or c” tag)
  5. Select tag “c”
  6. Add the “a or b or c” tag.

Tag all documents that have tags “a” AND “b” AND “c”

  1. Using the first procedure above, creates tags for “not a”, “not b”, and “not c”
  2. Using the second procedure above, create a “not a OR not b OR not c” tag.
  3. Using the first procedure above, crete “Not (not a OR not b OR not c)”, calling it “a and b and c”