How big is a document dump?

When journalists end up with a huge stack of documents they need to sort through, how big is that stack? One of the fun things about working on Overview is we get firsthand experience with many of these stories, and get to talk a lot of nerdy shop about the others.

So here’s our casual list of document sets that journalists have had to contend with. I’ve thrown in the links where possible, and a description of how the documents were delivered.

  • U.K. MP expenses – 700,000 documents in 5,500 PDF files, from government website
  • Wikileaks Iraq war data – 391,832 structured records, each including a text descriptions.
  • Wikileaks diplomatic cables – 251,287 cables, each a few pages long
  • Military discharge records – 112,000 assorted files in just about every document file format
  • NSA files leaked by Snowden – 50,000 to 200,000 according to the NSA
  • Wikileaks Afghanistan war data – 91,731 structured records, same format as Iraq data
  • Free the Files – 43,200 political TV ad spending files, PDF scans of paper, from FCC website
  • Paul Ryan correspondence – 9000 pages, on paper, via FOIA request of more than 200 agencies
  • Tulsa PD emails – 8000 emails, in Outlook format, via FOIA request
  • Pentagon Papers – 7000 pages leaked 1973, on paper, now declassified and available
  • Illegal adoption market in U.S. – 5029 messages web-scraped from a Yahoo! forum
  • Iraq security contractor reports – 4,500 pages on paper,┬ávia FOIA request
  • North Carolina foodstamp woes – 4,500 pages of emails, on paper, via FOIA request
  • New York State proposed legislation – 1,680 bills, downloaded via government API
  • White House Gulf of Mexico drilling emails – 628 documents, mostly emails, on paper, via FOIA request
  • Dollars for Docs – 65 gigantic disclosure reports, mostly huge PDFs of tables

So how big is the typical document dump? Well, its, ah… how do you measure that? How does a “record” compare to a “page” or an “email”? The first thing I see when I look at this list is the huge variety of file formats, largely because we’ve had to spend so much time helping people with a huge variety of file formats (more on that).

And it’s not just digital file formats. There’s actual paper involved in this business. Paper is a very popular choice for answers to FOIA requests. This is partially a technology problem and partially just that paper is still really good at certain things, like making absolutely 100% sure your redactions cannot be undone and the document metadata has been stripped. And even when you do get an electronic format, you might end up with a single massive PDF with thousands of variable-length emails (in which case do this to load it into Overview.)

But we do have some numbers here, and maybe a page and an email might be about the same-ish amount of work to deal with, so let’s imagine it’s all comparable and call it all pages. Some sets are very large, up to 700,000, but most are in the 5000-10000 page range. I I’ll take a median instead of an average since the distribution is highly skewed, and… 9000.

The most typical size of document dump that journalists have to deal with is 9000 pages. At least, most typical of our collection. Half our cases are larger than that and half are smaller. The largest document sets that journalists work with are in the million range now, and we should expect that to incerase. (See also: how to configure Overview to handle more documents.)

9000 pages would take 150 hours to read through completely at a rate of one per minute, or about 20 eight-hour workdays. The largest document set on this list (the MP expenses) would take almost exactly four years to read if you worked every single day. This is why we need computers.