Comparing text to data by importing tags

Overview sorts documents into folders based on the topic of each document, as determined by analyzing every word in each document. But it can also be used to see how the document text relates to the date of publication,  document type, or any other field related to each document.

This is possible because Overview can import tags. To use this feature, you will need to get your documents into CSV file, which is a simple rows and columns spreadsheet format. As usual, the text of each document does in the “text” column. But you can also add a “tags” column which gives the tag or tags to be initially assigned to each document, separated by commas if more than one.

To demonstrate, let’s look at a portion of the Afghanistan War Logs. The original file CSV has over 70,000 documents, each of which has many columns as described in the header row:

uid,date,type,category,tracking number,title,text,region,attack on, ...

Looking at the data, the “type” field takes on only a few different values, such as “enemy action” and “explosive hazard.” Let’s use Overview to see how the content of each report — the actual text — aligns with the report type.

To do this, I edited the first row in the CSV file to change the “type” field to a “tags” field:

uid,date,tags,category,tracking number,title,text,region,attack on, ...

Rather than trying analyze several years worth of data at once, I also used a simple script to filter the rows by date, extracting the 3,078 documents from July 2009. (Overview currently has a limit of 50,000 documents per document set, and anyway it’s often useful to take specific subsets of big sets for close analysis.)

You can get this final edited file here. When it is loaded into Overview, the incident types automatically appear as tags.

Overview Afghan tags

Here I’ve selected the “Explosive Hazard” tag. You can see that most of the documents with this tag appear on the right side of the tree. But Overview doesn’t look at the tags when sorting documents into folders, just the text. Therefore, there is a pattern in how the text of a report relates to its type. More precisely, there is a correlation between the “text” and “type” fields.

It’s pretty easy to understand why in this case. If you look at the folders on the right side of the tree, you’re see they are labelled with words like “IED” and “found.” The authors of the reports used different language to describe incidents that involved an explosive device, relative to incidents that did not. Conversely, the documents tagged “Enemy Action” mostly end up on the left side of the tree. The other categories have much smaller numbers of documents and tend to appear grouped together in small folders much farther down the tree.

You can use these imported tags for several purposes:

  • to find where certain types of documents ended up in the tree
  • to determine what type of documents are in a particular folder of interest
  • to check that Overview is dividing your documents into meaningful folders
  • to see the relationships between  text and data

This last use — looking for correlations between text and data — is a powerful possibility. For example, you could feed publication year into the “tags” field to analyze how document topics changed over the decades.  Or you could use a “sex” tag to see if documents about men are different than documents about women. The possibilities are endless.