Advanced search: quoted phrases, boolean operators, fuzzy matching, and more

Overview now supports advanced syntax in the search field, like this:

This gives you an enormous amount of control over finding specific documents.

  • To find all documents that do not contain “elephant” search for “-elephant”
  • Use AND, OR and NOT to find all documents with a specific combination of words or phrases.
  • Use quotes to search for multiple word phrases like “disaster recovery” or “nacho cheese”.
  • Use ~ after any word to do a fuzzy search, like “Smith~”. This will match all words with up to two characters added, deleted, or changed. Great for searching through material with OCR errors.
  • By default, multiple words are now ANDed together if you don’t specify anything else.
  • Use “title:foo” to search for all documents with the word “foo” in the title (the title is the upload filename, or the contents of the title column if you imported from CSV)
  • You can use wildcards like * and ? or even full regular expressions using the syntax text:/regex/ or title:/regex/. Note that regular expressions cannot have spaces in them, because you are actually searching through the index of words used in the documents, not the original text.

There are actually many more things you can do with this powerful search tool. See the ElasticSearch advanced query syntax for details.

Getting your documents into Overview — the complete guide

The first and most common question from Overview users is how do I get my documents in? The answer varies depending the format of your material. There are three basic paths to get documents into Overview: as multiple files, from a single CSV file, and via DocumentCloud. But there are several other tricks you might need, depending on your situation.

1. The simple case – a folder full of documents
Overview can directly read many file types, such as PDFs, Word, PowerPoint, RTF, text files, and so on. Choose “Upload document files” and add each document using the file selection dialog box.

If you are using the Chrome browser, you can also select entire folders at once. If you are not using Chrome, you can add all files in a folder in one step by clicking on the first file, then scrolling down and shift+clicking on the last file. Overview will  skip any files that are already uploaded.

Overview will automatically OCR files that are scanned images, though this takes much longer.

You may want to split long documents into individual pages. See below for details on both of these cases.

2. Journalist collaboration – DocumentCloud project import
DocumentCloud is collaborative document hosting service operated exclusively for professional journalists. It supports upload, OCR, search, annotation, publishing of documents, which may be public or private. Overview can import a DocumentCloud project directly, and DocumentCloud’s annotation tools are available directly within the Overview document viewer. This method is a good choice if you have access to DocumentCloud and the document set isn’t too large, up to a few thousand documents.

3. Documents as data – a CSV file
Overview can read an entire document set as a single CSV file, with one row per document containing the document text and optional information such as URL, tags, and unique ID.  This might sound more complex than it is; the simple format is documented here. A CSV file is a good choice if you need to bring data in from another application, such as a social media monitoring tool. You also use a CSV file to import tags, which can be used to compare text to data.

4. Everything is in one huge file — split the pages
Our users often receive their documents as a small number of huge files, thousands of pages each. This is especially common with FOIA requests, or when the source material is a stack of paper. In this case you will want Overview to analyze your material at the page level, not the file level. Overview can split pages automatically when importing files.

What is xkcd all about? Text mining a web comic

I recently ran into a very cute visualization of the topics of XKCD comics. It’s made using a topic modeling algorithm where the computer automatically figures out what topics xkcd covers, and the relationships between them. I decided to compare this xkcd topic visualization to Overview, which does a similar sort of thing in a different way (here’s how Overview’s clustering works).

Stand back, I’m going to try science!

Fortunately the source text file was already in exactly the right format for import. It took less than a minute to load and cluster these 1,299 docs.

The first cluster I found was all the “hat guy” comics. Overview’s phrase detection created a first-level folder for “hat guy” and also threw “beret” in there. Nice, but there’s a lot of non-hat related stuff in that folder too. This other material splits out into its own node two levels down, and seems to be comics about “guys” or “boys” and “girls.” That’s a pretty wide topic as opposed to hat guy comics (it includes a guy-girl duo Christmas special). I removed the guy-girl folder from the hat tag, and the result is shown in green below.

It’s fun to see exactly what each folder contains, because aside from the imported text descriptions of each comic there is conveniently a URL column in the source CSV, which becomes a clickable “source” link in the UI.

Another large first level folder (143 docs) contains comics about “graphs” or “axes”, “chart,” “lines,” etc. This one is a pretty clean folder, in that almost everything in it is one of xkcd’s charts, visualizations, maps, etc. or there is some sort of labelled schematic that appears in one of the panels, like this one. Overview was even able to separate out different types of charts, such as this folder which is mostly bar charts.

Then I started looked though smaller, lower level folders. I quickly found a newscast folder. What’s interesting about this folder is that there is no one word in common between all the newscast comics. But these comics have enough overlap through terms like “news”, “anchor”, and “press” that they get grouped together anyway. I’ve went through each of the 15 docs (open the first doc in the folder, keep pressing next using either the arrow or the “j” key, untag when appropriate) to get an idea of how coherent or not this cluster is. 10 of the 15 are newscasts, as you can see from the orange tag highlight on the node in this image.

The screenshot also shows the programming folder to the right of the newscast folder (11 docs). Again, there is no one term that appears across all these docs. If there was Overview would label the node with “ALL: programmer” or something. Instead we get some “programmer” but also “code” and “algorithm” and “mobile.” Again Overview has succeeded in finding a concept even though there is disparate language.

Topic quality varies throughout the tree, with some tight, interpretable topics and also some large “miscelaneous” folders. Of course you can always type a word into the search field to see exactly where documents containing a particular word ended up in the tree. I put about 30 minutes into this and I’ve tagged about 400 of the 1300 documents. (I could finish the job by using the new show untagged feature.)  So we might get a pretty complete picture of what’s available in the xkcd universe in about 2 hours total. Of course if you need high precision on the tags on individual documents we have to manually check them (select tag, then press “j” repeatedly to scan the docs quickly.) Assigning tags to folders in Overview tends to over-tag somewhat because there is often some miscellaneous stuff in a folder.

How does Overview compare to other topic modeling algorithms?

Many folks have heard of topic modeling algorithms, which are different from but related to Overview’s text analysis. Topic modeling works by automatically assigning one of a predefined number of “topics” to each word in each document, whilst simultaneously figuring out which words should belong to which topic. There are many different topic modeling algorithms but many are based on a technique called Latent Dirichlet Allocation (LDA.) You can get a feel for what LDA does by doing it yourself with pen and paper.

My exploration of xkcd was inspired by a recent LDA analysis of the web comic by Carson Sievert. Here’s what that looks like, as a visualization of the extracted topics and their words (click for larger):

Overview doesn’t derive “topics” directly. Instead it uses multi-level document clustering algorithms based on a standard technique called tf-idf cosine similarity.  We do this because it’s simpler to implement, much faster to run on large document sets, and — we suspect — easier to interpret because each document gets placed in exactly one folder, whereas LDA assigns multiple topics per document. Arguably what Overview does is “topic modeling” since it tries to create topic-themed folders, but that name usually refers to LDA-type algorithms and  I’ve been wondering for some time how Overview’s clustering compares.

The “topics” of an LDA analysis are really just distributions of words, where some words are very common in that topic (perhaps “fish” if the topic is “the ocean”) and others are more rare. LDA topics correspond roughly to Overview’s folders, so let’s see how they compare. I was able to find a few points of reference in the LDA visualization aboive. Topic #1 seems to be all charts. Topic 17 has “hat” and “guy”, though I don’t see “beret” in there. There are many uninterpretable “miscellaneous” topics and a lot of seemingly random words in the tail of each topic. However, these words might make more sense if we could see the source comics easily from the interactive. LDA has many tuning parameters and algorithmic variants, and  it’s possible that it might work especially well for other document sets; it seems to do a nice job on the Sarah Palin emails.

We’ve run into the problem of diversity: Randall Munroe writes about a huge range of different things, as defined by words and phrases that only appear in one or two comics. Also, many of the comics are hard to model since they have little text or feature only relatively generic words like “guy” and “woman.” This is actually a very common situation for document sets (or other high-dimensional data) and LDA and Overview deal with this heterogeneity in different ways. LDA seems to start “modeling the noise” by adding unrelated words to the words-in-a-topic distributions, while Overview ends up generating really miscellaneous folders that don’t resolve into a clear conceptual whole until several levels down the tree, or sometimes not at all.

Ultimately I don’t think the choice of text analysis algorithm is all that important, as long as you have one that works reasonably well. Topic modeling and document clustering are mathematically related anyway. The real trick in document mining is building a system that people can actually understand, trust, and use, as a recent paper from Stanford’s visualization lab makes wonderfully clear. Flexible document import, clear visualizations, rapid tagging, integrated search, easy document viewing —  text mining is about much more than algorithms. Still, we are always exploring new types of analysis and visualization for Overview, so it’s fun to see how different techniques compare.

New: Show all untagged documents

Overview’s tags help you keep track of where you’ve been. Now there’s an easy way to see where you  haven’t been: the Show Untagged button.

Show Untagged appears in the tag window at the bottom of the screen. When you press it, you’ll get a visual display of how many documents in each folder have no tag at all applied, as above (from our xkcd analysis.) This is very useful if you need to exhaustively explore your documents, just to be certain you haven’t missed anything. Of course exhaustive analysis doesn’t mean you must actually read every document. Instead, open each folder down to whatever level of detail you need in order to decide whether the material inside is relevant or not. When you’ve made that choice, tag and move on to the next folder.

Thanks to our users who asked for this feature. If there’s something that would really speed up your work, contact us.