Overview is hiring!

We need two Java or Scala developers to build the core analytics and visualization components of Overview, and lead the open-source development community. You’ll work in the newsroom at AP’s global headquarters in New York, which will give you plenty of exposure to the very real problems of large document sets.

The exact responsibilities will depend on who we hire, but we imagine that one of these positions will be more focused on user experience and process design, while the other will do the computer science heavy lifting — though both must be strong, productive software engineers. Core algorithms must run on a distributed cluster, and scale to millions of documents. Visualization will be through high-performance OpenGL. And it all has to be simple and obvious for a reporter on deadline who has no time to fight technology. You will be expected to implement complex algorithms from academic references, and expand prototype techniques into a production application.

You will work closely with investigative reporters on real stories, ensuring that the developing application serves their real world document-dump reporting needs. You will also work with visualization experts and other specialists from across industry and academia, and act as the technical lead for the open-source development and user communities.

We can offer competitive salaries for this two-year contract. Please send your resume to jstray@ap.org.

Requirements:

  • demonstrated ability to design and a ship large application with a clean, minimal, functional user interface
  • BSc. in CS, EE, or equivalent familiarity with computer science theory
  • mathematical ability, especially statistical models and linear algebra
  • 5 years experience as a Java software developer
  • familiarity with distributed open source development projects
  • experience in computer graphics and distributed systems a plus

 

 

Investigating thousands (or millions) of documents by visualizing clusters

This is a recording of my talk at the NICAR (National Institute of Computer-Assisted Reporting) conference in February 2011, where I discussed some of our recent work at the AP with the Iraq and Afghanistan war logs.

References cited in the talk:

  • “A full-text visualization of the Iraq war logs”, a detailed writeup of the technique used to generate the first set of maps presented in the talk.
  • The Glimmer high-performance, parallel multi-dimensional scaling algorithm, which is the software I presented in the live demo portion. It will be the basis of our clustering work going forward. (We are also working on other large-scale visualizations which may be more appropriate for e.g. email dumps.)
  • “Quantitative Discovery from Qualitative Information: A General-Purpose Document Clustering Methodology.” Justin Grimmer, Gary King, 2009. A paper that everyone working in document clustering needs to read. It clearly makes the point that there is no “best” clustering, just different algorithms that correspond to different pre-conceived frames on the story — and gives a method to compare clusterings (though I don’t think it will scale well to millions of docs.)
  • Wikipedia pages for bag of words model, tf-idf, and cosine similarity, the basic text processing techniques we’re using.
  • Gephi, a free graph visualization system, which we used for the one-month Iraq map. It will work up to a few tens of thousands of nodes.

What can we accomplish in two years?

Overview is an ambitious project. The prototype workflow is based on automatically clustering documents by analyzing patterns of word usage, and our results so far are very promising. But that doesn’t immediately mean that this is the direction that development should take. There is a whole universe of document and data set problems facing journalists today, and wide range of computational linguistics, visualization, and statistical methods we could try to apply. The space of possibility is huge.

But we can already say a few things about what must be accomplished for the project to be considered a success. We’re thinking not only about what must be accomplished by the end of the two-year grant, but how we’d like the project to evolve long after that. Within the space of our two year grant, we think we need to do the following things:

  1. Build an active community of developers and users
  2. Release production software that works well for common problems
  3. Develop a core, scalable architecture that vastly increases the pace of R&D

We need your help on each of these goals, in different ways.
Continue reading What can we accomplish in two years?