Your big pile of documents might arrive in many different forms — from a stack of paper to an archive of random files. But PDFs are a popular document file format that every reporter, researcher or analyst has to work with sooner or later. Now you can upload them directly into Overview
Just choose “Upload PDF files” from the “Import Documents” menu, then “Add files” to open a file selection box. You’ll want to upload more than one file of course, so you can select all files in a directory by pressing Control-A (Windows) or ⌘-A (Mac). Or you can select multiple specific files using the keyboard and mouse in the usual way (if you’re not familiar with how to do that, here are instructions for Windows and Mac)
You can press “Add Files” as many times as you like. Overview will begin uploading files as soon they are added, and then proceed to clustering and visualization when you press “Done Adding Files” to set the import options (such as language and words to ignore.)
Overview treats each file as “document,” so if you have one long PDF with many documents within it, you will want to split them first. There are several free tools to do this, both web-based and command line. We will soon add the ability to split documents into pages automatically, as is already possible when importing from DocumentCloud. Also, Overview does not (yet!) do OCR, which is the process of making a scanned image searchable. If you can search your PDFs or cut and paste text from them, they do not need OCR and Overview will be able to handle them. Otherwise, you can use commercial products such as Abby and Omnipage to do the OCR on your own computer before uploading.
Other ways to get your documents into Overview
You can still import documents in several other ways:
- You can import a DocumentCloud project
- Your can upload a CSV file. CSV files are a general data transfer format that most applications can write, including SQL databases, Microsoft Excel, and social media data sources like DataSift or Radian 6.
- We’ve also written a powerful script which will scan a folder for documents in many different formats, automatically OCR if needed, and produce a CSV file for Overview. Check out docs2csv if this fits your problem.