I recently ran into a very cute visualization of the topics of XKCD comics. It’s made using a topic modeling algorithm where the computer automatically figures out what topics xkcd covers, and the relationships between them. I decided to compare this xkcd topic visualization to Overview, which does a similar sort of thing in a different way (here’s how Overview’s clustering works).
The first cluster I found was all the “hat guy” comics. Overview’s phrase detection created a first-level folder for “hat guy” and also threw “beret” in there. Nice, but there’s a lot of non-hat related stuff in that folder too. This other material splits out into its own node two levels down, and seems to be comics about “guys” or “boys” and “girls.” That’s a pretty wide topic as opposed to hat guy comics (it includes a guy-girl duo Christmas special). I removed the guy-girl folder from the hat tag, and the result is shown in green below.
It’s fun to see exactly what each folder contains, because aside from the imported text descriptions of each comic there is conveniently a URL column in the source CSV, which is where the “Open in new tab” link in the UI will take you (under the gear menu.)
Another large first level folder (143 docs) contains comics about “graphs” or “axes”, “chart,” “lines,” etc. This one is a pretty clean folder, in that almost everything in it is one of xkcd’s charts, visualizations, maps, etc. or there is some sort of labelled schematic that appears in one of the panels, like this one. Overview was even able to separate out different types of charts, such as this folder which is mostly bar charts.
Then I started looked though smaller, lower level folders. I quickly found a newscast folder. What’s interesting about this folder is that there is no one word in common between all the newscast comics. But these comics have enough overlap through terms like “news”, “anchor”, and “press” that they get grouped together anyway. I’ve went through each of the 15 docs (open the first doc in the folder, keep pressing next using either the arrow or the “j” key, untag when appropriate) to get an idea of how coherent or not this cluster is. 10 of the 15 are newscasts, as you can see from the orange tag highlight on the node in this image.
The screenshot also shows the programming folder to the right of the newscast folder (11 docs). Again, there is no one term that appears across all these docs. If there was Overview would label the node with “ALL: programmer” or something. Instead we get some “programmer” but also “code” and “algorithm” and “mobile.” Again Overview has succeeded in finding a concept even though there is disparate language.
Topic quality varies throughout the tree, with some tight, interpretable topics and also some large “miscelaneous” folders. Of course you can always type a word into the search field to see exactly where documents containing a particular word ended up in the tree. I put about 30 minutes into this and I’ve tagged about 400 of the 1300 documents. (I could finish the job by using the new show untagged feature.) So we might get a pretty complete picture of what’s available in the xkcd universe in about 2 hours total. Of course if you need high precision on the tags on individual documents we have to manually check them (select tag, then press “j” repeatedly to scan the docs quickly.) Assigning tags to folders in Overview tends to over-tag somewhat because there is often some miscellaneous stuff in a folder.
How does Overview compare to other topic modeling algorithms?
Many folks have heard of topic modeling algorithms, which are different from but related to Overview’s text analysis. Topic modeling works by automatically assigning one of a predefined number of “topics” to each word in each document, whilst simultaneously figuring out which words should belong to which topic. There are many different topic modeling algorithms but many are based on a technique called Latent Dirichlet Allocation (LDA.) You can get a feel for what LDA does by doing it yourself with pen and paper.
My exploration of xkcd was inspired by a recent LDA analysis of the web comic by Carson Sievert. Here’s what that looks like, as a visualization of the extracted topics and their words (click for larger):
Overview doesn’t derive “topics” directly. Instead it uses multi-level document clustering algorithms based on a standard technique called tf-idf cosine similarity. We do this because it’s simpler to implement, much faster to run on large document sets, and — we suspect — easier to interpret because each document gets placed in exactly one folder, whereas LDA assigns multiple topics per document. Arguably what Overview does is “topic modeling” since it tries to create topic-themed folders, but that name usually refers to LDA-type algorithms and I’ve been wondering for some time how Overview’s clustering compares.
The “topics” of an LDA analysis are really just distributions of words, where some words are very common in that topic (perhaps “fish” if the topic is “the ocean”) and others are more rare. LDA topics correspond roughly to Overview’s folders, so let’s see how they compare. I was able to find a few points of reference in the LDA visualization aboive. Topic #1 seems to be all charts. Topic 17 has “hat” and “guy”, though I don’t see “beret” in there. There are many uninterpretable “miscellaneous” topics and a lot of seemingly random words in the tail of each topic. However, these words might make more sense if we could see the source comics easily from the interactive. LDA has many tuning parameters and algorithmic variants, and it’s possible that it might work especially well for other document sets; it seems to do a nice job on the Sarah Palin emails.
We’ve run into the problem of diversity: Randall Munroe writes about a huge range of different things, as defined by words and phrases that only appear in one or two comics. Also, many of the comics are hard to model since they have little text or feature only relatively generic words like “guy” and “woman.” This is actually a very common situation for document sets (or other high-dimensional data) and LDA and Overview deal with this heterogeneity in different ways. LDA seems to start “modeling the noise” by adding unrelated words to the words-in-a-topic distributions, while Overview ends up generating really miscellaneous folders that don’t resolve into a clear conceptual whole until several levels down the tree, or sometimes not at all.
Ultimately I don’t think the choice of text analysis algorithm is all that important, as long as you have one that works reasonably well. Topic modeling and document clustering are mathematically related anyway. The real trick in document mining is building a system that people can actually understand, trust, and use, as a recent paper from Stanford’s visualization lab makes wonderfully clear. Flexible document import, clear visualizations, rapid tagging, integrated search, easy document viewing — text mining is about much more than algorithms. Still, we are always exploring new types of analysis and visualization for Overview, so it’s fun to see how different techniques compare.