Images in this post missing? We recently lost them in a site migration. We're working to restore these as you read this. Should you need an image in an emergency, please contact us at imagehelp@codebetter.com
Netflix Memoirs: Top Contender Divulges Algorithm

The Netflix Challenge has been a maddeningly difficult problem since it's introduction a few months ago.  (It's certainly been a prominent cause of my frequent lack of sleep during this same time span.)  Ironically, the data itself has one of the simplest structures you can conceive of, consisting of a user id, a movie id, a rating and a date.  (The illusive simplicity of the data is the core reason for pulling your hair out if you work on it long enough.)

Throughout the challenge, Simon Funk has been one of the top names on the Netflix Leaderboard; as of today, he stands in the number 10 slot with a 5.37% improvement over Netflix' own proprietary recommendation system.  (The leader is currently at 6.13% improvement.)  Funk has been kind enough to divulge, in detail, his algorithm for getting amongst the top contenders:  http://sifter.org/~simon/journal/20061211.html.

Very surprisingly (to me anyway), Funk doesn't mine direct correlations between users to predict a rating.  (See my previous post on using the Pearson Correlation Coefficient for an overview of taking user correlations into account.)  Many challenge attempts so far have used grouping strategies to find users that have a high level of correlation between each other.  Since there are over 115 billion possible user correlations within the data, grouping strategies are a practical way to efficiently find promising user correlations.   (Neural networks can be leveraged to assist in finding the groupings; one overview of this approach can be found at http://www.cs.toronto.edu/~rsalakhu/papers/science.pdf.  Incidentally, the author of this paper is also in the top 10 of contenders.)  Instead of looking at user correlations, Funk leverages a learning algorithm which categorizes movies into n types of categories - 40 to be exact.  This makes it simpler to see which types of categories a particular user likes or dislikes.  Example categories include action, adventure, romance, etc.  But these categories aren't specified explicitly - they coalesce naturally while iteratively reducing the error resulting from the algorithm.  Imagine the movies each being rocks of various size - each size would correlate to a movie category.  Using a sieve with very small holes will separate the smallest rocks from the rest - this collection of small rocks becomes the first category of movies.  Increase the size of the holes a little bit and get the next category of movies.  Repeat 38 more times.  Even though the piles of rocks, or group of movies, can't be labeled descriptively, they've been categorized appropriately, nonetheless.  Obviously, this is a very abstract description of Funk's algorithm, but it makes it a bit simpler to visualize while reading the technique.  Even if you're not participating in the challenge, Funk's approach is a very interesting read involving machine learning, singular value decomposition, and statistical overfitting.  It's certainly not an easy read, but you're sure to learn something new with each pass.  A lenghthy discussion, along with some example implemtions with source-code, can be found at http://www.netflixprize.com/community/viewtopic.php?id=453.

On a related note, if this type of stuff interests you, and you enjoy the mathematical side of computing, I'd recommend (an attempt at) reading Concrete Mathematics by Graham, Knuth (the author of The Art of Computer Programming), and Patashnik.  Chapter 2 is particularly applicable to daily-programming life.  (I can proudly say that I made it through the first 1 1/2 chapters of the book before having to put it down to allow my brain to heal.  And don't let if fool you, the word "foundation" in the sub-title of the book is more attuned to describing the foundation of a skyscraper than that of a tree-house.)

I hope you find these offbeat, programmatic segways to be an interesting addition to the rest of the devlicio.us content.

Billy


Posted 01-02-2007 3:41 PM by Billy McCafferty
Filed under:

[Advertisement]

Comments

Evan wrote re: Netflix Memoirs: Top Contender Divulges Algorithm
on 01-02-2007 7:32 PM

thanks for this..the netflix challenge has been the object of my interest as well..i'll file this technique away and maybe use it for a future project :-)

Benrice wrote re: Netflix Memoirs: Top Contender Divulges Algorithm
on 01-03-2007 10:49 AM

It's a bit like the 20q.com 20 questions game.   The answers for it range from absolute Yes to absolute No with Probably, Sometimes, and Doubtful in the middle.  No matter the question asked, it gets sifted into one of those 5 heaps (with a couple heaps like Uknown and  Irrelevant thrown in).  Whether the question is wholly relevant "Is it bigger than a bread box." or completely out of the blue "Does it like Brad Pitt movies?" the question gets stacked into one of those heaps to aid in the decision.

About The CodeBetter.Com Blog Network
CodeBetter.Com FAQ

Our Mission

Advertisers should contact Brendan

Subscribe
Google Reader or Homepage

del.icio.us CodeBetter.com Latest Items
Add to My Yahoo!
Subscribe with Bloglines
Subscribe in NewsGator Online
Subscribe with myFeedster
Add to My AOL
Furl CodeBetter.com Latest Items
Subscribe in Rojo

Member Projects
DimeCasts.Net - Derik Whittaker

Friends of Devlicio.us
Red-Gate Tools For SQL and .NET

NDepend

SlickEdit
 
SmartInspect .NET Logging
NGEDIT: ViEmu and Codekana
LiteAccounting.Com
DevExpress
Fixx
NHibernate Profiler
Unfuddle
Balsamiq Mockups
Scrumy
JetBrains - ReSharper
Umbraco
NServiceBus
RavenDb
Web Sequence Diagrams
Ducksboard<-- NEW Friend!

 



Site Copyright © 2007 CodeBetter.Com
Content Copyright Individual Bloggers

 

Community Server (Commercial Edition)