Getting Started with the Overview Prototype

Note: these instructions apply only to the original prototype version, which you probably don’t want to use. Try the public server instead.

You can be up and running with the Overview prototype, browsing through the sample document sets, in just a few minutes.

Getting ready
First you will need Git to download the program and sample files. If you’re not used to Git, this might be a bit of a pain now, as opposed to just a straightforward download. But because Overview is a prototype under active development, we’re constantly fixing bugs and adding features, and Git lets you download and install new versions in one easy step. Git can be downloaded here for Windows and here for Mac.

You will also need Ruby to run Overview. For windows, you can download an installer here. Be sure to download the latest (Ruby version 1.9.3) because Ruby 1.8.7 won’t work. On the Mac, you’re in luck because Ruby comes pre-installed.

Overview also needs Java. Windows comes with Java preinstalled these days. Some versions of OS X come with Java too, and the ones that don’t will asked you want to install Java the first time you try to run Overview. Say yes. However, there have been some reports that Overview won’t run with the Java version that comes with OS X 10.5 and above… so 10.6 or above is recommended.

If you’re on Linux, you’re probably already comfortable with standard development tools, so I’m just say that you need to get Git, Ruby 1.9.3, and Java going. You will also need to replace visualization/lib/swt.jar with the appropriate version for your operating system.

Installing the prototype
Now that you have Git installed, you can download the Overview program files by entering the following command in Terminal (Mac) or the command prompt (Windows)  like so:

 git clone

Congratulations! You now have Overview on your computer. You’ll probably want some sample data files to get started.

Loading the sample files
Before you load your own documents, you probably want to get one of the sample files loaded. Get them form Github:

 git clone

You now have three sample dataset: 1,500 press releases from nj-senator-mendendez, 4,500 OCR’d pages of iraq-contractor-incidents, and the caracas-cables, about 7,000 Wikileaks cables which originate from or mention the city of Caracas. Each of these datasets is a single CSV input file containing all the text of all of the documents, plus another file containing some tags we created earlier (which you can view with the “load” button in the interface.)

Viewing a dataset is currently two step process. First, you have to do some natural language preprocessing. This takes a few minutes, but only has to be done once. Then you can start the GUI. Starting from the directory where you ran the git clone commands above, you can load up Senator Mendendez’s press releases like so:


  cd overview-sample-files
  ..overview-prototypepreprocess.bat nj-senator-menendez
  ..overview-prototypeoverview.bat nj-senator-menendez


  cd overview-sample-files
  ../overview-prototype/ nj-senator-menendez
  ../overview-prototype/ nj-senator-menendez

Or substitute iraq-contractor-incidents or caracas-cables if you’d like to view those document sets. Again, you only need to run the preprocess script once; you can start right up with the overview script every time thereafter.

Using your own documents
There are three ways you can use your own documents: you can visualize PDF or TXT files on your local machine, upload the documents to DocumentCloud, or import the documents from a CSV file.

Suppose you have a directory called documents-dir full of PDF and/or TXT files. You can view load it into overview like this:


  overview-prototypeloadpdf.bat documents-dir mydocs
  overview-prototypeoverview.bat mydocs


  overview-prototype/ documents-dir mydocs
  overview-prototype/ mydocs

Overview will scan documents-dir and all sub directories for PDF and TXT documents and generate all the files that it needs to run the visualization. This only needs to be done once. Thereafter, you can start the visualization with the overview command whenever you like.

If you would like to upload that directory to DocumentCloud first, so that you can annotate, share, and eventually publish some or all of the documents, do this:


  overview-prototypedcupload.bat documents-dir mypassword mydocs
  overview-prototypepreprocess.bat mydocs
  overview-prototypeoverview.bat mydocs

Mac/Linux: documents-dir mypassword mydocs mydocs mydocs

You have to replace “mylogin” and “mypassword” with your documentcloud username and password, of course. Overview will scan all subfolders of documents-dir for PDF and TXT files and upload them to DocumentCloud, simultaneously creating an input CSV file for the preprocess script that links to the newly uploaded documents.  Again, you only need to run dcupload and preprocess once.

If you have documents or text in some other format and you’re handy with CSV files, you can also create your own input files for Overview directly. The format is quite simple, documented here.

You should also check out video introduction to Overview. We are putting together more training materials, as well as building a few tools to help get your documents into the right format, such as a script that will read all the text or html files in a directory and glue them together in the csv that Overview likes. We’re also always fixing bugs. To get the very latest version (for instance, after we tweet about a bug fix), just open up a command prompt in the overview-prototype directory and type

 git pull

Finally, this is just a prototype. Our next step will be to integrate Overview with DocumentCloud, and make it run right in the browser.

