Professor Sebastian Thrun quits Stanford to teach people

The text of his homepage today:
One of the most amazing things I’ve ever done in my life is to teach a class to 160,000 students. In the Fall of 2011, Peter Norvig and I decided to offer our class “Introduction to Artificial Intelligence” to the world online, free of charge. We spent endless nights recording ourselves on video, and interacting with tens of thousands of students. Volunteer students translated some of our classes into over 40 languages; and in the end we graduated over 23,000 students from 190 countries. In fact, Peter and I taught more students AI, than all AI professors in the world combined. This one class had more educational impact than my entire career. Just watch this video.
Now that I saw the true power of education, there is no turning back. It’s like a drug. I won’t be able to teach 200 students again, in a conventional classroom setting. I’ve just peeked through a window into an entire new world, and I am determined to get there.
(and yes, I gave up my tenured position at Stanford)
I could not be more impressed with Sebastian.  From a story about this:

… the physical class at Stanford… dwindled from 200 students to 30 students because the online course was more intimate and better at teaching…”  

Kill hashtables, get shorter code

Driving late at night two months ago I wondered, why do we return hashtables from functions?  Isn’t a hashtable like a function?  Instead of calling some function to get a hashtable, then looking up values on that hashtable with keys, why not simply call the function directly each time I want a value?  Just make the ‘key’ the last argument of the new function.  So I tried it on the code from my last post, where I explained hierarchical document clustering and showed some Clojure that does it.  In this post I’ll show how I eliminated hashtables in that code and got a shorter codebase, comparing the already pretty short original to the new code.  

The best way to show the change will be to walk you through what happens in a call to euclidean, which calculates the euclidean distance of two documents or an arbitrarily deep tree of documents, based on their word frequencies.    

In the original, ordinary version, I get a hashtable of the word frequencies of a tree of documents with a call to freq, which recursively combines the frequencies (hashtables) of the branches of the tree, ultimately calling freq-files on each leaf, which is a document, of the tree. freq-files returns a hashtable of frequencies for that a document.  We will pass the hashtable we get back from freq into euclidean.

In contrast, in this new version, we just call euclidean before getting any frequencies. euclidean calls freq once for each word in the corpus.  Each call to freq calculates the frequency for just that one word, again by recursively combining the frequency of that word of each of the branches, and ultimately calling freq-files on each document in the tree, which calculates just that word’s frequency for that document.  

This is a little weird.  Do you see what has happened here?  That data now is no longer represented with a hashtable - instead it’s not represented at all.

Doing all those function calls sounds grossly inefficient but it shouldn’t.  For one thing, who cares?  It’s just performance.  We can always profile and deal with problems like that later.  But second, it isn’t really any less efficient than getting the whole document worth of frequencies all at once, because now I’m memoizing - caching function return values.  I’ve wrapped all of the functions in that chain of calls down to and including freq-files (and beyond) in memoize [1].  For example, it looks like I’m reading files from disk every time I call freq-files.  But memoizing to-words prevents that, so following a first call, every time freq-files is called with the same file tree (but probably a different word) the memozied to-words returns a cached word list. frequencies-m (the memoized version of frequencies), is in turn called with that word list as its argument, a call which would be somewhat computationally costly but since it has a cached value associated with that argument it simply returns that.  This frequency needs to be relative, so we’ve got to divide by the word count of the doc.  I’ve got count wrapped in memoize too, calling it count-m, which I call on the results of calling to-words again, which again returns a cached value.  

The really cool thing is that that data is still there.  It’s just that you don’t see it or manipulate it in the code anymore.  It’s implicit.  

This change chops the code down by about 20%, from 55 non-comment lines of code to 44 (or 39, not counting the boilerplate function memoization, which would be a 30% reduction) [2]. 

This style of programming has downsides.  For me, wrapping functions in memoize means my editor / IDE can’t tell what arguments my functions take any more (a common problem when you have first class functions).  That sort of sucks.  Further, it makes some functions, like euclidean, hopelessly non-general, since euclidean now has to know what function to call to get a frequency [3].  It makes it harder to track down bugs.  When you get all the values you want from each function in a chain of functions, always passing the entire hashtable of them up the chain, it’s easier to figure out where a problem has come up than if you ask about a single value all the way down the call chain.  But the main disadvantage to this technique is that it doesn’t match the way I code.  I like to program very interactively.  For example, the first thing I did when I started coding this stuff up was to slurp up the documents in question and call frequencies on them.  With the resulting hashtables in hand I wrote code to find the distance between two document “vectors”.  With these distances I wrote the code to find the shortest ones, and so forth.  Calling the entire function chain to get each value is only possible after you’ve written that chain of functions.  While you’re building them it’s easier to pass all the data from each step to the next step.  But I haven’t really tried coding in this style from scratch yet, so who knows?  Maybe it’s got some other hidden advantage I don’t know about. 

But it’s hard to argue with brevity.  And by the way, there’s nothing inherently lisp-ish about this technique.  You can do it in any language with first class functions, like Python or Ruby, just as easily.

Comment on this post at News.YC.

In the Bay Area and need help with your machine learning project? Contact me at info@reatlas.com or twitter.com/herdrick  




[1]  And we’ve been doing some of that all along anyway, since we often call freq with the same arguments, over multiple calls to euclidean.

[2]  There’s actually another related change here that isn’t so interesting, but definitely helped.  In the original version, the word-list - the list of all words that appear somewhere in the documents - is calculated immediately upon calling cluster, and passed into each subsequent recursive call to cluster.  From there it’s passed into best-pairing, which passes it into euclidean, the only place where it is actually used. I did this because getting this list requires some file reading and computation and it seems ineffiecient to recalculate it every time I call euclidean, which is a lot. But that was the wrong way to think about this problem. When I went ahead and did the inefficent thing, the right fix became clear: memoization, again.  best-pairing’s argument is a list of file-trees in which every file in our corpus is represented, so we can calculate our list of words from that - so I did.  Now every call to euclidean makes a call to word-list.  But this isn’t such a problem because word-list and most of the functions it calls are memoized.  So the only additional cost of calling that function are several memo lookups - negligible - and flattening and sorting a file tree.  Which isn’t nothing but I don’t think it’s a big deal.  I should profile it to find out.

[3] Actually I could make euclidean more general by just sticking with the original version.  Because Clojure hashtables can be called as functions, I can just call euclidean like this: (euclidean (partial freq pof1) (partial freq pof2) (word-list pofs)), using closures created with partial function application as the frequency arguments.  Seemed a little harder to explain though, so I skipped it.

Cluster (with Clojure)

I was watching a video of Berkeley professor Michael Jordan lecturing on the Chinese Restaurant process and for a moment he showed a slide of a tree of documents that were matched up by word frequencies.  It seemed cool so I coded up my own version of it, mostly to learn about the topic and to get some practice with Clojure.  It turned out there’s a name for this:  hierarchical clustering.  I went with the ‘agglomerative’ version of it, which is repeatedly pairing up things and pairs of those things until you have a single pairing that, beneath it, contains everything you started with.  Usually you choose pairings based on similarity - you pair up the two available things that are most similar. 

To make documents into something you can easily compare, I’m converting each into a hashtable of the relative frequencies of the words it contains, like this:  {"purple" 0.0015, "it" 0.0023, "this" 0.0083  ...}  etc.  The code finds the two most similar docs based on those frequencies, and matches them up, making a new hashtable like that by averaging the frequencies of those two docs [1]. This “pairing” is now on the same footing with all the other documents.  So we again find the two most similar docs (allowing this new pairing to be treated as a doc), repeating that process until we’re left with only a single pairing.  It contains every pairing we made, and ultimately every document we started with.  This is our finished product, a hierarchical cluster.

You can see the code on my GitHub.

I turned it loose on some text files got the following (plotted with the Protoviz Javascript toolkit) [3]

Image of an agglomerative hierarchical cluster of documents, rendered by Protovis.

The blue nodes are the documents and the green nodes are pairings.  From top to bottom, the docs are: 

-the German Wikipedia article on the lambda calculus,
-the first several hundred words of a German novel, Wilhelm Meister’s Apprenticeship by Johann Wolfgang von Goethe,
-an exerpt from Shakespeare’s King Lear,
-the scifi short story, “They’re Made out of Meat”,
-the English Wikipedia article on the lambda calculus,
-the English Wikipedia article on “The Buzzer” or UVB-76,
-the Sherlock Holmes story, “The Red Headed League”,
-the Sherlock Holmes story, “A Scandal in Bohemia”,
-the scifi short story, “The Long Watch” by Robert Heinlein,
-Shakespeare’s first and second sonnets,
-Shakespeare’s third and fourth sonnets,
-the Spanish poems, “Candor” and “Reto” from Julio Flórez,
-the Spanish Wikipedia article on the lambda calculus,
-the Spanish Wikipedia article on functional programming,
-the Dutch Wikipedia article on the lambda calculus,

Notice that each pairing shows three words.  Those are the most ‘interesting words’: those whose frequencies do the most to make that pairing stand out from the average.  For example, if you’ve got a document that uses the word “mooloolaba” a few times, that’s probably going to be one of its interesting words because it’s so rare elsewhere.  But a word could also be interesting for not showing up, ex. if the word “the” never shows up in a document or only a few times in a long text.  In that case the word is in parens.

It seems to have done an OK job here.  It strongly leans toward matching up docs (and pairings) in the same language when possible.  I was hoping that it would be able to pull out the two science fiction stories, but that isn’t happening.  It’s not smart enough for that.  I’m pleased that it grouped the Spanish functional programming and lambda calculus articles before inluding the Spanish poetry.  It put Shakespeare’s sonnets together, but failed to associate them very closely with the exerpt from King Lear.  [4] 

I was also hoping the Dutch and German articles would clusted together before being joined with the Spanish or English docs, but this didn’t happen.  It might be that “de” is a common word in Dutch and Spanish, whereas “is” and “in” are common in Dutch and English.  So the classifier might see Dutch documents as partway between English and Spanish ones (even though the opposite is closer to the truth).  

Interesting.  I think this could be improved by using a statistical distance to measure the distance between vectors, instead of unscaled distance of relative frequencies as I’m doing now.

In playing with this code I stumbled on a interesting way to refactor it.  I’ll talk about that in the next post.






[1] This is called the vector space model, declaring each word to be a dimension, and each document a point, or vector, in that word-space.  Identifying docs by the words they contain and not worrying about word order is in general called using the “bag of words”.  I’m comparing those frequencies using Euclidean distance.  It’s often said to be better to use the cosine of the two vectors but that doesn’t matter here since the dimensions of any document vector sum to 1 (is there a name for such a vector?), and I’m only looking for a ranking of distance.

[2] I omitted the thirty lines of code to translate the s-expression result to the JSON needed for Protovis.

[3] I found in trial runs over all of Shakespeare’s sonnets that it did a pretty good job of sorting out the earlier sonnets from the later ones.

[4] The regular expression I used for extracting words is poor for non-English languages, but the algorithim can probably handle it anyway, as the fragments it creates will be unique to the words they came from.

 

In the Bay Area and need help with your machine learning project? Contact me at info@reatlas.com or twitter.com/herdrick  

Math question

Question: Can an edge of a hypergraph connect to some node more than once in some definition of a graph?  If so you could represent a text, for example a book, as a single directed hyperedge running through a graph of words.  EDIT:  Probably not.  If you could you’d lose lots of the useful properties of graphs, I think.

My other blog is a Posterous

My non-programming blog is here: http://herdrick.posterous.com/