The Netflix Challenge has been a maddeningly difficult problem since it's introduction a few months ago. (It's certainly been a prominent cause of my frequent lack of sleep during this same time span.) Ironically, the data itself has one of the simplest structures you can conceive of, consisting of a user id, a movie id, a rating and a date. (The illusive simplicity of the data is the core reason for pulling your hair out if you work on it long enough.)
Throughout the challenge, Simon Funk has been one of the top names on the Netflix Leaderboard; as of today, he stands in the number 10 slot with a 5.37% improvement over Netflix' own proprietary recommendation system. (The leader is currently at 6.13% improvement.) Funk has been kind enough to divulge, in detail, his algorithm for getting amongst the top contenders: http://sifter.org/~simon/journal/20061211.html.
Very surprisingly (to me anyway), Funk doesn't mine direct correlations between users to predict a rating. (See my previous post on using the Pearson Correlation Coefficient for an overview of taking user correlations into account.) Many challenge attempts so far have used grouping strategies to find users that have a high level of correlation between each other. Since there are over 115 billion possible user correlations within the data, grouping strategies are a practical way to efficiently find promising user correlations. (Neural networks can be leveraged to assist in finding the groupings; one overview of this approach can be found at http://www.cs.toronto.edu/~rsalakhu/papers/science.pdf. Incidentally, the author of this paper is also in the top 10 of contenders.) Instead of looking at user correlations, Funk leverages a learning algorithm which categorizes movies into n types of categories - 40 to be exact. This makes it simpler to see which types of categories a particular user likes or dislikes. Example categories include action, adventure, romance, etc. But these categories aren't specified explicitly - they coalesce naturally while iteratively reducing the error resulting from the algorithm. Imagine the movies each being rocks of various size - each size would correlate to a movie category. Using a sieve with very small holes will separate the smallest rocks from the rest - this collection of small rocks becomes the first category of movies. Increase the size of the holes a little bit and get the next category of movies. Repeat 38 more times. Even though the piles of rocks, or group of movies, can't be labeled descriptively, they've been categorized appropriately, nonetheless. Obviously, this is a very abstract description of Funk's algorithm, but it makes it a bit simpler to visualize while reading the technique. Even if you're not participating in the challenge, Funk's approach is a very interesting read involving machine learning, singular value decomposition, and statistical overfitting. It's certainly not an easy read, but you're sure to learn something new with each pass. A lenghthy discussion, along with some example implemtions with source-code, can be found at http://www.netflixprize.com/community/viewtopic.php?id=453.
On a related note, if this type of stuff interests you, and you enjoy the mathematical side of computing, I'd recommend (an attempt at) reading Concrete Mathematics by Graham, Knuth (the author of The Art of Computer Programming), and Patashnik. Chapter 2 is particularly applicable to daily-programming life. (I can proudly say that I made it through the first 1 1/2 chapters of the book before having to put it down to allow my brain to heal. And don't let if fool you, the word "foundation" in the sub-title of the book is more attuned to describing the foundation of a skyscraper than that of a tree-house.)
I hope you find these offbeat, programmatic segways to be an interesting addition to the rest of the devlicio.us content.
Billy
Posted
01-02-2007 3:41 PM
by
Billy McCafferty