Regular-expression search

Power users, rejoice: Overview has a powerful new way to search.

In addition to all the rest of our search syntax, we proudly present regular-expression search:Prior to today, Overview only let you search using an index: Overview’s tally of which words appear on which pages. That’s why it’s so fast: it doesn’t read all your documents every time you type a search phrase. But you can’t search the index for whitespace or capitalization. It only lets you search for words.

Enter regular-expression search. Now you can search for characters. (Unicode characters, to be precise) including expressions with spaces, digits, punctuation, or unusual characters in them.

Here are some example searches:

  • /Smith/: Search for all documents that contain the five characters: S, m, i, t, h. This will match “The Smithsonian”, because it contains those five characters. It won’t match “wordsmiths”, because there’s no uppercase S. (Searches are offset by slashes, and they’re case-sensitive.)
  • /(?i)Smith/: Search for all documents that contain the five characters: s, m, i, t, h, either lowercase or uppercase. This will match “wordsmiths.” ((?i) enables case-insensitive mode.)
  • /caf[ée]/: Search for the sequence: c, a, f, and then either e or é. (Beware: Unicode has two ways of representing é, and this expression only searches for one.)
  • /(?m)^-- \nAdam/: Search for a line starting with two dashes, a space, a newline, and then the letters Adam — someone’s plain-text email signature. ((?m) enables scanning for line beginnings and endings; ^ scans for line start; --  scans punctuation and spaces; \n scans for a newline. Don’t bother searching for \r: Overview stores all newlines as \n. Also, Overview stores all ISO control characters except \n and \t as spaces.)
  • /path/to/file.txt isn’t a regular expression: it’s a normal text search. (Regular expressions must end with / and can’t contain an inner /.)
  • /path\/to\/file\.txt/ matches this filename with the slashes. (Use \ to escape / and \ inside your regular expressions. You don’t need to double up on backslashes for typical regular-expression features like \. (match a period) and \b (match a word break).)
  • title:/file\.txt/ only searches the document title. (You can search text, title, notes and any other field. By default, Overview scans document text and title.)
  • text:Smith AND text:/Smith/: Search the index for the word smiththen scan the matching documents for the correct capitalization.

Alas, that last search raises an annoying pitfall. Since regular-expression search doesn’t use an index, it can be slow. When it gets too slow, Overview will stop searching and return an incomplete list of results. (Today, the limit is 2,000 documents.) You’ll see a warning when this happens. The workaround: use a normal search to find all documents that might match the regular expression, then AND that search with your regular-expression search. Only the documents matching the normal search will be scanned for your regular expression.

Regular expressions can be tricky to write. Use a regex tester to make sure you’re searching correctly. Our syntax is identical to the Go language’s regex syntax.

Sort by Metadata

Now you can sort your document list by metadata fields.

By default, Overview has always sorted documents alphabetically by title:

Now you can sort in a different way. First, add a Field and set a value on each document:

Now, change “Sorted by title” to your new Field:

You can click the arrow to reverse the sort order:

This should help you stay organized.

Add Notes to documents

Have you spotted an important paragraph you want to remember?

Now Overview lets you annotate documents you uploaded via File Upload. Here’s how:

  1. Open the document.
  2. Click Add Note in the document-view toolbar.
  3. Click and drag over the interesting area.
  4. Release the mouse button and type your note.
  5. Click Save.

You can flip through notes on a document using the Next Note and Previous Note buttons, next to the Add Note button.

Oh, and you can also search your notes. Add notes: to your search, like this:

Enjoy!

Search in your Fields

If you’ve added metadata fields to your documents, rejoice: now you can search them!

I’ll set metadata on two documents for a quick demonstration:

I've set metadata on this document....
I’ve set metadata on this document….
... And then I set metadata on this second document....
… And then I set metadata on this second document….

Now, I can search these new fields by prepending [Field Name]: to the search:

Search can now include custom fields
Search can now include custom fields

Here are the subtle rules. Consider them an addendum to our previous search syntax blog post:

  • Date:2015-11-08 will match all documents containing the phrase “2015-11-08” in the “Date” field (assuming you created a “Date” field).
  • date:2015-11-08 will not match any documents in this example. (Field names are case-sensitive.)
  • "Full Name":John Smith lets you specify a field name with spaces. (Field names with spaces or parentheses must be quoted.)
  • Author:Adam H* will search for any phrase starting with “Adam H” in the “Author” field. (All usual search syntax works on metadata fields.)
  • Be wary when your field names clash with Overview’s built-in names. text:Overview and title:Overview continue to search the actual documents’ texts and titles, not any “text” or “title” metadata fields you may have added. If you have a metadata field named “text“, you can force a metadata search by quoting the field name: for instance, "text":Overview.
  • We search all metadata the same way we search document text: we ignore punctuation and we don’t allow comparisons such as, “search for all documents with a Date within the past year”.

Five things you didn’t know Overview could do

1. Find names of places and companies

The Entities plugin will automatically find company names, place names (in multiple languages!), numbers, or just unusual words that aren’t in the dictionary. Like all plugins, it’s available under Add View.

The entities plugin

Overview’s entity detection algorithms are designed to err on the side of including things that aren’t entities, rather than missing things which are — unlike normal NLP techniques which often miss 50% of entities. You can hit the little red X’s to remove junk from the list.

2. Make scanned PDFs searchable (OCR)

Overview will automatically OCR any PDF which doesn’t seem to have any text in it, such as scanned pages, using the open-source Tesseract engine. Scanned documents will be much slower to load — but you won’t be able to search them until you OCR them somewhere, so why not let Overview do it?

If you’d like to get OCR’d files out of Overview, you can simply export the documents after Overview has loaded them. You’ll get searchable PDFs back.

3. Customize the Word Cloud

You can use the delete tool to remove the words that aren’t adding anything. When you remove words, less common words are added to fill up the space. This way you can zero in on exactly what you want to investigate.

Edit word cloud

You can have more than one word cloud at a time, through the Add View menu. Press the Hidden Words button to unhide words.

4. The all-powerful Export

You can export all documents or just the result of the current search. For example, you could download only documents with the word “pizza” in them. And you can export either one document per file, or just the text (and any custom fields) as a CSV.

Export options

This means you can use Overview as a text extractor: upload random files, download a clean spreadsheet of the text. Or an OCR machine: upload random files, get searchable PDFs back.

5. Add custom data to each document

Overview now supports custom fields, or as we like to call it, document metadata.

You can add a field and set the value for all documents in a batch import.

Add field on import

Or you can edit the fields on one document at a time in the document viewer. If you add a field to one document, it will appear (initially blank) for all documents.

document field

Or, if you load your documents via CSV, Overview will read in each extra column as a field.

Each field will be its own column when you export as a spreadsheet.

Add fields during import

Now you can add custom fields to all documents at once while you’re importing. You can use this make some other notes about this batch of documents, such as tracking the source of each document in your set.

I’ll walk you through it.

First, add files as usual
Use the (new) “Fields” interface to specify fields for these documents.
Now, every document you uploaded has the field values you wrote.
You can specify other field values whenever you add more documents to the document set.
The original documents’ fields will have the original values. The new documents’ fields will have the new values.
Export the document set, and you’ll see the field values for all documents.

Fields — or document metadata — are a great feature, and hopefully this makes them a little more useful.

Overview now does OCR!

We’ve added a new feature into Overview: Optical Character Recognition (OCR). That means you can upload scanned PDFs and Overview will automatically read the text from them.

Overview decides when to use OCR automatically, on every page that has fewer than 100 characters of searchable text. This will make your uploads a lot slower, but you will need to OCR them anyway before you can search them, and you can’t beat the convenience.

Overview uses Tesseract for OCR, because Tesseract is free. Sometimes Tesseract produces more garbage characters than other OCR engines, such as the one included in Adobe Acrobat Pro. If you’ve already OCR’d your documents using another program, Overview will just read the previously created text.

 

How Overview handles pesky Microsoft Excel

Overview has long had a fairly important (if little-used) feature: it can export all documents into a spreadsheet. Day one, we wrote the spreadsheet in “comma-separated values” (“CSV”) format.

Then we realized that Microsoft Excel couldn’t open all CSV files.

Day two, we implemented a separate export file type, just for Microsoft Excel. Here are the differences.

Continue reading How Overview handles pesky Microsoft Excel

Import, edit, and create document metadata

Have you ever needed to extract the author or write notes for each document? Now you can, with fields.

The new “Fields” section sits in wait underneath each document. Click it, and you’ll be able to create fields and change their values.

metadata-filled-in
The list of fields is the same for every document in a document set. Each document has its own field values.

You can create new fields directly in Overview, or import them as extra columns in your CSV (see: importing documents using a CSV file.)

All your fields will appear as new columns when you export a spreadsheet:

metadata-output-csv
You can use a spreadsheet program to filter for field values.

The “fields” feature is new. We know there’s plenty of pizzazz to add:

  • Right now, we only support single-line text fields: no dates, numbers, geo-coordinates, or so forth. As a workaround, format your text values carefully (e.g., use YYYY-MM-DD for dates) so your spreadsheet program can grok them.
  • Overview’s search feature doesn’t examine field values.
  • You cannot create fields or write to fields using the API. (You can read field values with the API, though: they’re in document.metadata.)
  • You can only set metadata on one document at a time.

Hi! Now we’re www.overviewdocs.com

We’re changing https://www.overviewproject.org to redirect to our new domain name, https://www.overviewdocs.com.

Why the change? Two reasons:

  1. Overview isn’t just an experimental “project” any more. Overview is a go-to tool for doc-crunching.
  2. Overview isn’t a non-profit “.org“. Overview Services Inc. is a commercial company. Don’t get us wrong — we’d love a donation as much as anybody. But our consulting work keeps the lights on.

How does this affect you, our user? Well … uh … the text in your browser’s URL field will shrink by three characters. That’s about it.

Our automatic redirects will kick in on today, Monday, July 13, 2015, around noon. Don’t worry: you won’t lose any of your work, even if you’re using Overview while we switch.

Update, July 13, 2015: all done.