What can we accomplish in two years?

Overview is an ambitious project. The prototype workflow is based on automatically clustering documents by analyzing patterns of word usage, and our results so far are very promising. But that doesn’t immediately mean that this is the direction that development should take. There is a whole universe of document and data set problems facing journalists today, and wide range of computational linguistics, visualization, and statistical methods we could try to apply. The space of possibility is huge.

But we can already say a few things about what must be accomplished for the project to be considered a success. We’re thinking not only about what must be accomplished by the end of the two-year grant, but how we’d like the project to evolve long after that.¬†Within the space of our two year grant, we think we need to do the following things:

  1. Build an active community of developers and users
  2. Release production software that works well for common problems
  3. Develop a core, scalable architecture that vastly increases the pace of R&D

We need your help on each of these goals, in different ways.

Build an active community of developers and users

The Overview team at The Associated Press will be developing the core architecture. But we need to get other people involved for a variety of reasons.

To begin with, although it’s easy to see the overall trend of large document and data sets in journalism, it’s harder to state clear goals for software designed to address this trend. What are the most painful things that journalists must do with document sets? What are the things they’d like to do but can’t today? What problems exist, and which should we focus on solving? The only way to answer these questions is to talk to users. Today, as we ramp up to main development, Overview doesn’t have any users outside a small group of AP staff who have experimented with our prototype. We need to find others who have similar problems and get them talking to us, and to each other.

We also need developers outside the AP. We need them for all the reasons that open-source is often such a spectacularly productive way to build software. Not only do external developers help us get the work done, but they force us to pay attention to the readability of the code and the extensibility of the architecture.

Release production software that works well for common problems

We now have funding to hire two top-notch developers for two years. (Thank you, News Challenge!) Just like a start-up, we have to deliver results before the money runs out. There’s no way we can solve all the world’s document dump reporting problems in 24 months, but we can solve a few. So far, we’ve been thinking about these general scenarios:

“The database dump” is epitomized by the Wikileaks releases. All of the documents are of the the same type, and the goal is to understand what the main topics and themes are. Our prototype is based around this use case.

“The email dump” is increasingly common. Topic clustering is still useful, but emails have other structure we can visualize. For example we might want to explore the social network of authors or the threads of conversation.

“The scrambled dump” often results from open records requests. Here there are lots of different types of documents all mixed up — emails and meeting minutes and budget spreadsheets and who knows what else. The first order of business is to automatically sort the documents into their original categories.

The source material for any of these cases might arrive on paper, or it might be digital in some annoying format like a 7,000-page PDF (true story.) It might be searchable text, or it might need OCR. In many cases, data import and cleanup is more work than analysis. So we can add a fourth tough problem: fast, easy, and flexible data import. This might be the hardest of all.

What problems are we missing? What would be most valuable for version 1?

Develop a core, scalable architecture that vastly increases the pace of R&D

Computational linguistics and visualization are rich fields which have now gone through multiple generations of technology. Almost none of the well-known techniques in these fields have been applied to journalism. The Overview project aims to borrow a few of these tools. But what we really aspire to do is to start the migration of technology from modern computer science into journalism. We want to build a platform that is nimble enough for experiments, and sophisticated enough for production applications. If we only solved the problems identified above, we might not end up with a flexible enough system to meet the next great reporting challenge.

This goal is about defining a large enough scope to future-proof the technology, while still managing to ship on time. We have chosen a target document set size of 10 million to force us to deal with all of the problems of big data, including distributed storage and computing. We wanted to have no choice but to go into the cloud. But this is also about building a design language to express text-based visual analytics algorithms. For example, we’d like to build our core functionality out of a set of computational linguistics building blocks. Users should be able to write and use new pieces. That’s the only way we can build a system that’s flexible and extensible enough to power many years worth of exploration in this space.