A visualization sketching system

Over the last year, my colleagues and I at The Associated Press have been exploring visualizations of very large collections of documents. We’re trying to solve a pressing problem: We have far more text than hours to read it. Sometimes a single Freedom of Information request will produce a thousand pages, to say nothing of the increasingly common WikiLeaks-sized dumps of hundreds of thousands of documents, or hugedatabases of public documents.

Because reading every word is impossible, a large data set is only as good as the tools we use to access it. Search can help us find what we’re looking for, but only if we know what we are looking for. Instead, we’ve been trying to make “maps” of large data sets, visualizations of the topics or locations or the interconnections between people, dates, and places. We’ve had a few notable successes, such as our  visualization of the Iraq war logs.

But frankly, this has been a slow process, because the tools for large-scale text analysis are terrible. Existing programs break when faced with more than a few thousand documents. More powerful software exists, but only in component form. It requires lots of programming to get a useful result.

Meanwhile, DIY visualization thrives. At the Eyeo festival in Minneapolis this summer, I was overwhelmed by the vibrant community that has formed around data visualization. Several hundred people sat in a room and listened raptly to talks by data artist Jer Thorp, social justice visualizer Laura Kurgan, the measurement-obsessed Nick Felton, and many others. Suddenly, a great many people are enthusiastically making images from code and data.

The weapon of choice for this community is Processing, a language designed specifically for interactive graphics by Ben Fry and Casey Reas (both of whom were at Eyeo). Creative communities thrive on good tools; think of Instagram, Instructables, or Wikipedia.

We want Overview to be the creative tool for people who want to explore text visualization — “investigative journalists and other curious people,” as our grant application put it.

The algorithms that our prototypes use are old by tech standards, dating mostly from information retrieval research in the ’80s. But then, the algorithms that the resurgent visualization community is implementing in Processing are mostly old, too; I coded many of them in C++ in the early 1990s when I was learning computer graphics programming. Today, one doesn’t have to learn C++ to make pictures with algorithms. The Processing programming environment takes care of all the hard and boring parts and provides a simple, lightweight syntax. It’s a visualization “sketching” system, tailor-made for the rapid expression of visual ideas in code.

No such programming environment exists if you want to do visualizations of the text content of large document sets. First, you have to extract some sort of meaning from the language. Natural language processing has a long history and is advancing rapidly, but the available toolkits still require a huge amount of specialist knowledge and programming skill.

Big data also requires many computers running in parallel, and while there are now wonderful components such as distributed NoSQL stores and the Hadoop map-reduce framework, it’s a lot of work to assemble all the pieces. The current state of the art simply doesn’t lend itself to experimentation. I’d love for people with modest technical ability to be able to play around with document set visualizations, but we don’t have the right tools.

This is the hole that we’d like Overview to fill. There are certain key problems, such as email visualization, that we know Overview has to solve. But we’d like to solve them by building a sort of text visualization programming system. The idea is to provide basic text processing operations as building blocks, letting the user assemble them into algorithms. It should be easy to recreate classic techniques, or invent new ones by trial and error. The distributed storage and data flow should be handled automatically behind the scenes, as much as possible.

That’s an ambitious project, and we are going to have to scale it down. Perhaps the first version of Overview won’t be as expressive or efficient as we’d like; we are explicitly prioritizing useful solutions to real problems over elegant tools that can’t be used for actual analysis. By the end of our Knight Foundation grant, Overview has to solve at least one difficult and essential problem in data journalism.

But ultimately, what we intend to build is a sketching system for visualizing the content and meaning of large collections of text documents — big text, as opposed to big data. Just as the Processing language has been a great enabler of the DIY visualization community, we hope that Overview will give interested folks a simple way to play with lots of different text processing techniques — and that we’ll all learn some interesting things from mining our ever-increasing store of public documents.

This post was originally published at PBS IdeaLab.

Overview is hiring!

We need two Java or Scala developers to build the core analytics and visualization components of Overview, and lead the open-source development community. You’ll work in the newsroom at AP’s global headquarters in New York, which will give you plenty of exposure to the very real problems of large document sets.

The exact responsibilities will depend on who we hire, but we imagine that one of these positions will be more focused on user experience and process design, while the other will do the computer science heavy lifting — though both must be strong, productive software engineers. Core algorithms must run on a distributed cluster, and scale to millions of documents. Visualization will be through high-performance OpenGL. And it all has to be simple and obvious for a reporter on deadline who has no time to fight technology. You will be expected to implement complex algorithms from academic references, and expand prototype techniques into a production application.

You will work closely with investigative reporters on real stories, ensuring that the developing application serves their real world document-dump reporting needs. You will also work with visualization experts and other specialists from across industry and academia, and act as the technical lead for the open-source development and user communities.

We can offer competitive salaries for this two-year contract. Please send your resume to jstray@ap.org.

Requirements:

  • demonstrated ability to design and a ship large application with a clean, minimal, functional user interface
  • BSc. in CS, EE, or equivalent familiarity with computer science theory
  • mathematical ability, especially statistical models and linear algebra
  • 5 years experience as a Java software developer
  • familiarity with distributed open source development projects
  • experience in computer graphics and distributed systems a plus

 

 

Investigating thousands (or millions) of documents by visualizing clusters

This is a recording of my talk at the NICAR (National Institute of Computer-Assisted Reporting) conference in February 2011, where I discussed some of our recent work at the AP with the Iraq and Afghanistan war logs.

References cited in the talk:

  • “A full-text visualization of the Iraq war logs”, a detailed writeup of the technique used to generate the first set of maps presented in the talk.
  • The Glimmer high-performance, parallel multi-dimensional scaling algorithm, which is the software I presented in the live demo portion. It will be the basis of our clustering work going forward. (We are also working on other large-scale visualizations which may be more appropriate for e.g. email dumps.)
  • “Quantitative Discovery from Qualitative Information: A General-Purpose Document Clustering Methodology.” Justin Grimmer, Gary King, 2009. A paper that everyone working in document clustering needs to read. It clearly makes the point that there is no “best” clustering, just different algorithms that correspond to different pre-conceived frames on the story — and gives a method to compare clusterings (though I don’t think it will scale well to millions of docs.)
  • Wikipedia pages for bag of words model, tf-idf, and cosine similarity, the basic text processing techniques we’re using.
  • Gephi, a free graph visualization system, which we used for the one-month Iraq map. It will work up to a few tens of thousands of nodes.

What can we accomplish in two years?

Overview is an ambitious project. The prototype workflow is based on automatically clustering documents by analyzing patterns of word usage, and our results so far are very promising. But that doesn’t immediately mean that this is the direction that development should take. There is a whole universe of document and data set problems facing journalists today, and wide range of computational linguistics, visualization, and statistical methods we could try to apply. The space of possibility is huge.

But we can already say a few things about what must be accomplished for the project to be considered a success. We’re thinking not only about what must be accomplished by the end of the two-year grant, but how we’d like the project to evolve long after that. Within the space of our two year grant, we think we need to do the following things:

  1. Build an active community of developers and users
  2. Release production software that works well for common problems
  3. Develop a core, scalable architecture that vastly increases the pace of R&D

We need your help on each of these goals, in different ways.
Continue reading What can we accomplish in two years?

A full-text visualization of the Iraq war logs

This is a description of some of the proof-of-concept work that led to the Overview prototype, originally posted elsewhere.

Last month, my colleague Julian Burgess and I took a shot a peering into the Iraq War Logs by visualizing them in bulk, as opposed to using keyword searches in an attempt to figure out which of the 391,832 SIGACT reports we should be reading. Other people have created visualizations of this unique document set, such as plots of the incident locations on a map of Iraq, and graphs of monthly casualties. We wanted to go a step further, by designing a visualization based on the richest part of each report: the free text summary, where a real human describes what happened, in jargon-inflected English.

Also, we wanted to investigate more general visualization techniques. At The Associated Press we get huge document dumps on a weekly or sometimes daily basis. It’s not unusual to get 10,000 pages from a FOIA request — emails, court records, meeting minutes, and many other types of documents, most of which don’t have latitude and longitude that can be plotted on a map. And all of us are increasingly flooded by large document sets released under government transparency initiatives. Such huge files are far too large to read, so they’re only as useful as our tools to access them. But how do you visualize a random bunch of documents?

We’ve found at least one technique that yields interesting results, a graph visualization where each document is node, and edges between them are weighted using cosine-similarity on TF-IDF vectors. I’ll explain exactly what that is and how to interpret it in a moment. But first, the journalism. We learned some things about the Iraq war. That’s one sense in which our experiment was a success; the other valuable lesson is that there are a boatload of research-grade visual analytics techniques just waiting to be applied to journalism.

click for super hi-res version

Interpreting the Iraq War, December 2006
This is a picture of the 11,616 SIGACT (“significant action”) reports from December 2006, the bloodiest month of the war. Each report is a dot. Each dot is labelled by the three most “characteristic” words in that report. Documents that are “similar” have edges drawn between them. The location of the dot is abstract, and has nothing to do with geography. Instead, dots with edges between them are pulled closer together. This produces a series of clusters, which are labelled by the words that are most “characteristic” of the reports in that cluster. I’ll explain precisely what “similar” and “characteristic” mean later, but that’s the intuition.

Continue reading A full-text visualization of the Iraq war logs