Power users, rejoice: Overview has a powerful new way to search.
In addition to all the rest of our search syntax, we proudly present regular-expression search:Prior to today, Overview only let you search using an index: Overview’s tally of which words appear on which pages. That’s why it’s so fast: it doesn’t read all your documents every time you type a search phrase. But you can’t search the index for whitespace or capitalization. It only lets you search for words.
Enter regular-expression search. Now you can search for characters. (Unicode characters, to be precise) including expressions with spaces, digits, punctuation, or unusual characters in them.
Here are some example searches:
/Smith/: Search for all documents that contain the five characters: S, m, i, t, h. This will match “The Smithsonian”, because it contains those five characters. It won’t match “wordsmiths”, because there’s no uppercase S. (Searches are offset by slashes, and they’re case-sensitive.)
/(?i)Smith/: Search for all documents that contain the five characters: s, m, i, t, h, either lowercase or uppercase. This will match “wordsmiths.” ((?i) enables case-insensitive mode.)
/caf[ée]/: Search for the sequence: c, a, f, and then either e or é. (Beware: Unicode has two ways of representing é, and this expression only searches for one.)
/(?m)^-- \nAdam/: Search for a line starting with two dashes, a space, a newline, and then the letters Adam — someone’s plain-text email signature. ((?m) enables scanning for line beginnings and endings; ^ scans for line start; -- scans punctuation and spaces; \n scans for a newline. Don’t bother searching for \r: Overview stores all newlines as \n. Also, Overview stores all ISO control characters except \n and \t as spaces.)
/path/to/file.txt isn’t a regular expression: it’s a normal text search. (Regular expressions must end with / and can’t contain an inner /.)
/path\/to\/file\.txt/ matches this filename with the slashes. (Use \ to escape / and \ inside your regular expressions. You don’t need to double up on backslashes for typical regular-expression features like \. (match a period) and \b (match a word break).)
title:/file\.txt/ only searches the document title. (You can search text, title, notes and any other field. By default, Overview scans document text and title.)
text:Smith AND text:/Smith/: Search the index for the word smith, then scan the matching documents for the correct capitalization.
Alas, that last search raises an annoying pitfall. Since regular-expression search doesn’t use an index, it can be slow. When it gets too slow, Overview will stop searching and return an incomplete list of results. (Today, the limit is 2,000 documents.) You’ll see a warning when this happens. The workaround: use a normal search to find all documents that might match the regular expression, then AND that search with your regular-expression search. Only the documents matching the normal search will be scanned for your regular expression.
If you’ve added metadata fields to your documents, rejoice: now you can search them!
I’ll set metadata on two documents for a quick demonstration:
Now, I can search these new fields by prepending [Field Name]: to the search:
Here are the subtle rules. Consider them an addendum to our previous search syntax blog post:
Date:2015-11-08 will match all documents containing the phrase “2015-11-08” in the “Date” field (assuming you created a “Date” field).
date:2015-11-08 will not match any documents in this example. (Field names are case-sensitive.)
"Full Name":John Smith lets you specify a field name with spaces. (Field names with spaces or parentheses must be quoted.)
Author:Adam H* will search for any phrase starting with “Adam H” in the “Author” field. (All usual search syntax works on metadata fields.)
Be wary when your field names clash with Overview’s built-in names. text:Overview and title:Overview continue to search the actual documents’ texts and titles, not any “text” or “title” metadata fields you may have added. If you have a metadata field named “text“, you can force a metadata search by quoting the field name: for instance, "text":Overview.
We search all metadata the same way we search document text: we ignore punctuation and we don’t allow comparisons such as, “search for all documents with a Date within the past year”.
Now you can add custom fields to all documents at once while you’re importing. You can use this make some other notes about this batch of documents, such as tracking the source of each document in your set.
I’ll walk you through it.
Fields — or document metadata — are a great feature, and hopefully this makes them a little more useful.
We’ve added a new feature into Overview: Optical Character Recognition (OCR). That means you can upload scanned PDFs and Overview will automatically read the text from them.
Overview decides when to use OCR automatically, on every page that has fewer than 100 characters of searchable text. This will make your uploads a lot slower, but you will need to OCR them anyway before you can search them, and you can’t beat the convenience.
Overview uses Tesseract for OCR, because Tesseract is free. Sometimes Tesseract produces more garbage characters than other OCR engines, such as the one included in Adobe Acrobat Pro. If you’ve already OCR’d your documents using another program, Overview will just read the previously created text.
All your fields will appear as new columns when you export a spreadsheet:
The “fields” feature is new. We know there’s plenty of pizzazz to add:
Right now, we only support single-line text fields: no dates, numbers, geo-coordinates, or so forth. As a workaround, format your text values carefully (e.g., use YYYY-MM-DD for dates) so your spreadsheet program can grok them.
Overview’s search feature doesn’t examine field values.
You cannot create fields or write to fields using the API. (You can read field values with the API, though: they’re in document.metadata.)
You can only set metadata on one document at a time.
Overview supports phrase searches, fuzzy searches, and booleans. Here’s what you can search for in the search box and the Multi-Search plugin:
John Smith: All documents containing the phrase “John Smith“. All the words, in order.
Pizza~: All documents matching the word “Pizza” or similar words such as “Piazza” or “Pizzas“. (“~” after a single word means fuzzy search. It can find documents that contain typos.)
John Smith AND Alice Smith: All documents containing both the phrase “John Smith” and the phrase “Alice Smith“. (“AND” means both phrases must appear.)
John Smith OR Alice Smith: All documents containing either the phrase “John Smith” or the phrase “Alice Smith” Or both phrases. (“OR” means any phrase must appear.)
John Smith AND NOT Alice Smith: All documents containing the phrase “John Smith” and not the phrase “Alice Smith“. (“NOT” means the phrase must not appear.)
Alice AND NOT (Bob OR Carol): All documents containing the phrase “Alice” and neither the phrase “Bob” nor the phrase “Carol“. (Parentheses help organize complicated queries.)
"John and Alice Smith": All documents containing the phrase “John and Alice Smith“. (Without quotation marks, it would have been interpreted as “(John) AND (Alice Smith)“. Quotation marks tell Overview to ignore operators such as AND, OR and NOT. You can use quotation marks or apostrophes.)
John Smith~2: All documents matching the phrase “John Smith” or phrases with the words John and Smith at most 2 words apart, such as “John 'The Culprit' Smith“. (“~N” after a multi-word phrase means proximity search.)
Smith*: All documents containing a word that begins with “Smith“, such as “Smith“, “Smithy” or “Smithsonian“. (“*” after a phrase means prefix search.)
title:John Smith: All documents containing the phrase “John Smith” in their titles. The other way around is text:John Smith. By default, Overview searches both the title and the text.
In the coming months, we’ll be sitting with users to see how this new query language works for them. If you have any feedback about a particular query, please use the “Talk to Us” link at the top of Overview.