View the same documents in different ways with multiple trees

Starting today Overview supports multiple trees for each document set. That is, you can tell Overview to re-import your documents — or a subset of them — with different options, without uploading them again. You can use this to:

  • Focus on a subset of your documents, such as those with a particular tag or containing a specific word.
  • Use ignored and important words to try sorting your documents in different ways.

You create a new tree using the “New Tree” link above the tree:

This brings up a dialog box that looks very similar to the usual import options. You can name the tree (good for reminding yourself why you made it) and set ignored and important words to tell Overview how you want your documents organized in this tree. You can also include only those documents with a specific tag.

To create a tree that contains only words matching a particular search term, first turn your search into a tag using the “create tag from search results” button next to the search box.

Tags are shared between all of the trees created from a document set. That means when you tag a document in one tree, it will be tagged in every other tree. You can try viewing your documents with different trees, tagging in whatever tree is easiest to use.

After you create a tree, you can get information about what you created by clicking the little (i) on the tab for that tree:


Who will bring AI to those who cannot pay?

One Sunday night in 2009, a man was stabbed to death in the Brentwood area of Long Island. Due to a recent policy change there was no detective on duty that night, and his body lay uncovered on the sidewalk until morning. Newsday journalist Adam Playford wanted to know if the Suffolk County legislature had ever addressed this event. He read through 7,000 pages of meeting transcripts and eventually found the council talking about it:

the incident in, I believe, the Brentwood area…

This line could not have been found through text search. It does not contain the word “police” or “body,” or the victim’s name or the date, and “Brentwood” matches too many other documents. Playford read the transcripts manually — it took weeks — because there was no other way available to him.

But there is another way, potentially much faster and cheaper. It’s possible for a computer to know that “the incident in Brentwood” refers to the shooting, if it’s programmed with enough contextual information and sophisticated natural language reasoning algorithms. The necessary artificial intelligence (AI) technology now exists. IBM’s Watson system used these sorts of techniques to win at Jeopardy, playing against world champions in 2011.

Last month, IBM announced the creation of a new division dedicated to commercializing the technology they developed for Watson. They plan to sell to “healthcare, financial services, retail, travel and telecommunications.”

Journalism is not on this list. That’s understandable, because there is (comparatively speaking) no money in journalism. Yet there are journalists all over the world now confronted with enormous volumes of complex documents, from leaks and open government programs and freedom of information requests. And journalism is not alone. The Human Rights Data Analysis group is painstakingly coding millions of handwritten documents from the archives of the former Guatemalan national police. UN Global Pulse applies big data for humanitarian purposes, such as understanding the effects of sudden food price increases. The crisis mapping community is developing automated social media triage and verification systems, while international development workers are trying to understand patterns of funding by automatically classifying aid projects.

Who will serve these communities? There’s very little money in these applications; none of these projects can pay anywhere near what a hedge fund or a law firm or intelligence agency can. And it’s not just about money: these humanitarian fields have their own complex requirements, and a tool built for finding terrorists may not work well for finding stories. Our own work with journalists shows that there are significant domain-specific problems when applying natural language processing to reporting.

The good news is that many people are working on sophisticated software tools for journalism, development, and humanitarian needs. The bad news is that the problem of access can’t be solved by any piece of software. Technology is advancing constantly, as is the scale and complexity of the data problems that society faces. We need to figure out how to continue to transfer advanced techniques — like the natural language processing employed by Watson, which is well documented in public research papers — to the non-profit world.

We need organizations dedicated to continuous transfer of AI technology to these underserved sectors. I’m not saying that for-profit companies cannot do this; there may yet be a market solution, and in any case “non-profit” organizations can charge for services (as the Overview Project does for our consulting work.) But it is clear that the standard commercial model of technology development — such as IBM’s billion dollar investment in Watson — will largely ignore the unprofitable social uses of such technology.

We need a plan for sustainable technology transfer to journalism, development, academia, human rights, and other socially important fields, even when they don’t seem like good business opportunities.