Investigating thousands (or millions) of documents by visualizing clusters

This is a recording of my talk at the NICAR (National Institute of Computer-Assisted Reporting) conference in February 2011, where I discussed some of our recent work at the AP with the Iraq and Afghanistan war logs.

References cited in the talk:

  • “A full-text visualization of the Iraq war logs”, a detailed writeup of the technique used to generate the first set of maps presented in the talk.
  • The Glimmer high-performance, parallel multi-dimensional scaling algorithm, which is the software I presented in the live demo portion. It will be the basis of our clustering work going forward. (We are also working on other large-scale visualizations which may be more appropriate for e.g. email dumps.)
  • “Quantitative Discovery from Qualitative Information: A General-Purpose Document Clustering Methodology.” Justin Grimmer, Gary King, 2009. A paper that everyone working in document clustering needs to read. It clearly makes the point that there is no “best” clustering, just different algorithms that correspond to different pre-conceived frames on the story — and gives a method to compare clusterings (though I don’t think it will scale well to millions of docs.)
  • Wikipedia pages for bag of words model, tf-idf, and cosine similarity, the basic text processing techniques we’re using.
  • Gephi, a free graph visualization system, which we used for the one-month Iraq map. It will work up to a few tens of thousands of nodes.

3 thoughts on “Investigating thousands (or millions) of documents by visualizing clusters”

  1. Very cool.

    Have you thought about integrating the work done on WordNet at Princeton into this system? How important is the clustering of the exact syntax or lexical form of words versus their semantic meaning?

    Also, what about allowing documents to belong to more than one cluster? Could not a particular document be relevant to more than one cluster at a time?

  2. We’ve thought about WordNet but it may be of limited use for our purposes, because the sorts of document sets we deal with very often have specialized vocabularies. It will be interesting to see how this unfolds as we develop the project further.

    All of the clustering algorithms we use are “soft” so that documents can belong to more than one cluster. Actually, we aren’t doing clustering at all in the traditional sense of categorization, but rather spatial layout algorithms.

Comments are closed.