Cluster (with Clojure)

I was watching a video of Berkeley professor Michael Jordan lecturing on the Chinese Restaurant process and for a moment he showed a slide of a tree of documents that were matched up by word frequencies.  It seemed cool so I coded up my own version of it, mostly to learn about the topic and to get some practice with Clojure.  It turned out there’s a name for this:  hierarchical clustering.  I went with the ‘agglomerative’ version of it, which is repeatedly pairing up things and pairs of those things until you have a single pairing that, beneath it, contains everything you started with.  Usually you choose pairings based on similarity - you pair up the two available things that are most similar. 

To make documents into something you can easily compare, I’m converting each into a hashtable of the relative frequencies of the words it contains, like this:  {"purple" 0.0015, "it" 0.0023, "this" 0.0083  ...}  etc.  The code finds the two most similar docs based on those frequencies, and matches them up, making a new hashtable like that by averaging the frequencies of those two docs [1]. This “pairing” is now on the same footing with all the other documents.  So we again find the two most similar docs (allowing this new pairing to be treated as a doc), repeating that process until we’re left with only a single pairing.  It contains every pairing we made, and ultimately every document we started with.  This is our finished product, a hierarchical cluster.

You can see the code on my GitHub.

I turned it loose on some text files got the following (plotted with the Protoviz Javascript toolkit) [3]

Image of an agglomerative hierarchical cluster of documents, rendered by Protovis.

The blue nodes are the documents and the green nodes are pairings.  From top to bottom, the docs are: 

-the German Wikipedia article on the lambda calculus,
-the first several hundred words of a German novel, Wilhelm Meister’s Apprenticeship by Johann Wolfgang von Goethe,
-an exerpt from Shakespeare’s King Lear,
-the scifi short story, “They’re Made out of Meat”,
-the English Wikipedia article on the lambda calculus,
-the English Wikipedia article on “The Buzzer” or UVB-76,
-the Sherlock Holmes story, “The Red Headed League”,
-the Sherlock Holmes story, “A Scandal in Bohemia”,
-the scifi short story, “The Long Watch” by Robert Heinlein,
-Shakespeare’s first and second sonnets,
-Shakespeare’s third and fourth sonnets,
-the Spanish poems, “Candor” and “Reto” from Julio Flórez,
-the Spanish Wikipedia article on the lambda calculus,
-the Spanish Wikipedia article on functional programming,
-the Dutch Wikipedia article on the lambda calculus,

Notice that each pairing shows three words.  Those are the most ‘interesting words’: those whose frequencies do the most to make that pairing stand out from the average.  For example, if you’ve got a document that uses the word “mooloolaba” a few times, that’s probably going to be one of its interesting words because it’s so rare elsewhere.  But a word could also be interesting for not showing up, ex. if the word “the” never shows up in a document or only a few times in a long text.  In that case the word is in parens.

It seems to have done an OK job here.  It strongly leans toward matching up docs (and pairings) in the same language when possible.  I was hoping that it would be able to pull out the two science fiction stories, but that isn’t happening.  It’s not smart enough for that.  I’m pleased that it grouped the Spanish functional programming and lambda calculus articles before including the Spanish poetry.  It put Shakespeare’s sonnets together, but failed to associate them very closely with the excerpt from King Lear.  [4] 

I was also hoping the Dutch and German articles would cluster together before being joined with the Spanish or English docs, but this didn’t happen.  It might be that “de” is a common word in Dutch and Spanish, whereas “is” and “in” are common in Dutch and English.  So the classifier might see Dutch documents as partway between English and Spanish ones (even though the opposite is closer to the truth).  

Interesting.  I think this could be improved by using a statistical distance to measure the distance between vectors, instead of unscaled distance of relative frequencies as I’m doing now.

In playing with this code I stumbled on a interesting way to refactor it.  I’ll talk about that in the next post.






[1] This is called the vector space model, declaring each word to be a dimension, and each document a point, or vector, in that word-space.  Identifying docs by the words they contain and not worrying about word order is in general called using the “bag of words”.  I’m comparing those frequencies using Euclidean distance.  It’s often said to be better to use the cosine of the two vectors but that doesn’t matter here since the dimensions of any document vector sum to 1 (is there a name for such a vector?), and I’m only looking for a ranking of distance.

[2] I omitted the thirty lines of code to translate the s-expression result to the JSON needed for Protovis.

[3] I found in trial runs over all of Shakespeare’s sonnets that it did a pretty good job of sorting out the earlier sonnets from the later ones.

[4] The regular expression I used for extracting words is poor for non-English languages, but the algorithm can probably handle it anyway, as the fragments it creates will be unique to the words they came from.

In the Bay Area and need help with your machine learning project? Contact me at info@reatlas.com or twitter.com/herdrick  

  1. zoltanvarju reblogged this from herdrick
  2. ukash-kart reblogged this from herdrick
  3. joopkiefte reblogged this from herdrick
  4. herdrick posted this