One Sunday night in 2009, a man was stabbed to death in the Brentwood area of Long Island. Due to a recent policy change there was no detective on duty that night, and his body lay uncovered on the sidewalk until morning. Newsday journalist Adam Playford wanted to know if the Suffolk County legislature had ever addressed this event. He read through 7,000 pages of meeting transcripts and eventually found the council talking about it:
the incident in, I believe, the Brentwood area…
This line could not have been found through text search. It does not contain the word “police” or “body,” or the victim’s name or the date, and “Brentwood” matches too many other documents. Playford read the transcripts manually — it took weeks — because there was no other way available to him.
But there is another way, potentially much faster and cheaper. It’s possible for a computer to know that “the incident in Brentwood” refers to the shooting, if it’s programmed with enough contextual information and sophisticated natural language reasoning algorithms. The necessary artificial intelligence (AI) technology now exists. IBM’s Watson system used these sorts of techniques to win at Jeopardy, playing against world champions in 2011.
Last month, IBM announced the creation of a new division dedicated to commercializing the technology they developed for Watson. They plan to sell to “healthcare, financial services, retail, travel and telecommunications.”
Journalism is not on this list. That’s understandable, because there is (comparatively speaking) no money in journalism. Yet there are journalists all over the world now confronted with enormous volumes of complex documents, from leaks and open government programs and freedom of information requests. And journalism is not alone. The Human Rights Data Analysis group is painstakingly coding millions of handwritten documents from the archives of the former Guatemalan national police. UN Global Pulse applies big data for humanitarian purposes, such as understanding the effects of sudden food price increases. The crisis mapping community is developing automated social media triage and verification systems, while international development workers are trying to understand patterns of funding by automatically classifying aid projects.
Who will serve these communities? There’s very little money in these applications; none of these projects can pay anywhere near what a hedge fund or a law firm or intelligence agency can. And it’s not just about money: these humanitarian fields have their own complex requirements, and a tool built for finding terrorists may not work well for finding stories. Our own work with journalists shows that there are significant domain-specific problems when applying natural language processing to reporting.
The good news is that many people are working on sophisticated software tools for journalism, development, and humanitarian needs. The bad news is that the problem of access can’t be solved by any piece of software. Technology is advancing constantly, as is the scale and complexity of the data problems that society faces. We need to figure out how to continue to transfer advanced techniques — like the natural language processing employed by Watson, which is well documented in public research papers — to the non-profit world.
We need organizations dedicated to continuous transfer of AI technology to these underserved sectors. I’m not saying that for-profit companies cannot do this; there may yet be a market solution, and in any case “non-profit” organizations can charge for services (as the Overview Project does for our consulting work.) But it is clear that the standard commercial model of technology development — such as IBM’s billion dollar investment in Watson — will largely ignore the unprofitable social uses of such technology.
We need a plan for sustainable technology transfer to journalism, development, academia, human rights, and other socially important fields, even when they don’t seem like good business opportunities.