Overviewproject.org can now import a document set from a CSV file. This is standard, simple format for tabular data that many programs can read and write. For example, Excel can save a spreadsheet as a CSV file. This allows use of Overview without uploading the documents to DocumentCloud, and makes it much easier to import data from sources such as Twitter.
There are two basic steps to loading a CSV file into Overview: get your data in the correct format, and upload it. You may also need to extract text from a set of PDFs.
Getting your data into the correct format
A CSV file is simply a list of “comma-separated values,” organized into rows and columns, like a spreadsheet or a table. The file starts with a list of the column names, separated by commas. This is followed by each row of data, one row per line, with the values for each column again separated by commas. Overview only requires one column, which much be named “text.” Here is an example file:
text This is the content of the first document. And here is the text of document the second Document three talks about quick brown foxes. . .
If the text of a document spans multiple lines, or itself contains commas, then it needs to be quoted. Quotes inside a quoted document must be “escaped” by turning them into double quotes. This is all standard CSV stuff, and any program or library that writes CSVs should do it automatically.
text "This document is really long and crosses multiple lines and contains commas, which is why it is quoted." "This is the second document. I'd like to say ""Hi!"" to everyone to show how to put quotes inside a quoted document. The text of this document can cross as many lines as needed, or even contain blank lines like this: The second document ends with this final quote." The third document fits all on one line so no quotes needed. "The fourth document has a comma in it, so it's quoted too." . .
And that’s it. Overview will display the text in the viewer pane when you click on each document. If you want to display something else for the document, you can add a “url” column which tells Overview to load a particular web page when you view that document. For security reasons, this must be an https URL. Here’s an example using tweets:
text,url New deploy today -- cleaner clustering, better handling of larger document sets. Anyone got a pile of PDFs they want to look at? Try it!,https://twitter.com/overviewproject/status/281075194557259777 """“I’m not going to sit out on the newsroom floor and sort pages into stacks of documents"" ~@jackgillum on need for document mining software.",https://twitter.com/overviewproject/status/264450385928929280 . .
It’s also possible to add a unique ID column, simply named “id”, which Overview will read and associate with the document, which is how documents will be referenced when you export tags (coming soon.)
Uploading your CSV file to Overview
First, select the upload option from the main document set list page:
Then choose a file. Overview will show a preview and do some basic checks to ensure that the format is OK. It should look like this:
You can also tell Overview what character encoding the file uses. Try changing this if you see funny square characters in the preview, or accents aren’t displaying right. Then hit upload, and away we go. You can use Overview as usual on the document set.
Creating a CSV for Overview from a collection of PDFs
Overview does not currently support viewing a collection of PDFs without DocumentCloud in an integrated way. However, there is a workaround, based on a tool from the prototype version. You will need some familiarity with the command line to do this. First install Git and download the prototype, then use the loadpdf script to extract the text from a folder full of PDF files and create a CSV suitable for uploading into Overview. This process is described in the documentation for the prototype.
Unfortunately you will not be able to view the original PDFs within Overview without putting them on a web server somewhere and then modifying the URL column to point to the location of each document. We’re working on it.