Documents not sorted the way you'd like? Try ignoring words

Overview categorizes documents by looking for those words that best separate one group of documents from all the others. If many documents contain the word “cat” but most do not, Overview will create a “cat” folder.  This works great until it doesn’t — maybe you don’t care about cats. That’s why  you can tell Overview to ignore any specific word when sorting your documents.

You can do this by typing in the words to ignore, in the import options.

There are many cases where Overview might sort documents based on words you don’t care about:

  • emails might be sorted based on who sent them, when you’re more interested in what’s in them
  • forms might be sorted based on the questions, rather the answers
  • documents might be sorted based on administrative blather (versions, copyright, disclaimers…)
  • If you got the documents by searching for A, B, and C, they might end up just sorted into A, B, and C folders.
  • maybe you just don’t care about X, but Overview made a folder for it anyway

Overview is prone to making these mistakes because it examines every single word (and two word phrases) when deciding how to file a document. An alternate approach is to use entity extraction which only looks for recognizable people, places, organizations, dates, etc. But entity extraction is often unreliable and doesn’t capture all sorts of other meaningful words, like verbs. You probably want to know when many documents have the words “paid” or “killed” in them.

More fundamentally, Overview doesn’t understand your story. It really has no idea what a “meaningful” organization of the documents might be. All it can do is find patterns of language usage between documents. Not only does the computer lack basic human understanding, even a very smart computer can’t get inside your head: what is interesting depends on what you, the analyst, thinks is interesting.

It might seem like the answer is to tell Overview what you care about, but one of the central ideas of Overview is that you shouldn’t have to know what’s in the documents before you look at them — otherwise it’s not possible to discover the unexpected. Instead, if you think that Overview is over-emphasizing the obvious or the trivial, you can tell the computer what isn’t important.