Images in this post missing? We recently lost them in a site migration. We're working to restore these as you read this. Should you need an image in an emergency, please contact us at imagehelp@codebetter.com
Netflix Memoirs: Using the Pearson Correlation Coefficient

Updated 2/12/07 to include SQL for computing movie prediction.


Slugging my way through the Netflix challenge, I've come across a number of computational techniques of which I was previously unaware. One such technique, using the Pearson correlation coefficient to find similar movie raters, seems useful in a wide variety of situations. What follows is a brief introduction to this calculation and its applicability in project work.

Imagine you're developing an intranet to be used by the entire company. The intranet will house forums, business documents, contacts, HR forms and plenty of articles from both internal and external sources. A common complaint from users of large intranets is that it is difficult to search through the sea of information to find useful content, pertinent to the task at hand. Unless it's something HR related, it's often easier to ask Google than to look within an intranet. One of the strengths that makes Google so great is the fact that it usually returns very applicable results. Your intranet would likely be more useful, and more used, if this same level of applicability could be applied to the employee's experience.

Combining content-ratings with the Pearson correlation coefficient, you can preemptively suggest content that the employee may find useful. For example, suppose an employee has rated a number of bridge-construction content items highly. Incidentally, a number of bridges have been completed in the past and the intranet has a plethora of content related to the subject.  The problem is having to scour all the forums, documents, articles, etc to find other content that may be useful. By encouraging users to rate content, you can look for items that similar users have rated highly and proactively present the results to the user.

There are a number of techniques that may be employed to find suggested content; what's described here utilizes the fully-computed Pearson correlation coefficient. I mention "fully-computed" as it suggests performing the calculation on every unique, user combination.  Although relatively accurate, it does not scale well for many users. If you have a lot of users, over a few thousand or so, you can augment the technique with cluster-based calculations. By clustering users into a few categories of similarity, it greatly decreases the number of users to compare to. (The details of performing cluster-based calculations are beyond the scope of this post.)

Using the Pearson Correlation Coefficient

The basic idea of finding recommended content based on the Pearson correlation coefficient is as follows:

  1. Calculate the "similarity" between each user by comparing how each user has rated common content. If Frank has rated something 4/5 stars and Jane has also rated it 4/5 stars, then these users would be considered similar. These calculations are very time consuming as it essentially becomes the "handshake" problem. I.e. the calculation has to be performed for each unique combination of users. The number of unique combinations is: n (n - 1) / 2. For the Netflix challenge, the number of unique combinations is 115,290,497,766...yes that's 115 billion...that's a lot of handshaking! Even with just a few users, these calculations should be performed offline.

    Similarity between users is calculated using the Pearson Correlation Coefficient. This calculation returns a value between -1 and 1. Two users with a similarity of 1 have rated every item identically. The formula is as follows:


    Pa, u is the similarity between users a and u. m is the total number of items in common. ra, i is the rating user a gave an item. ra with the bar over it is the average of all user a's ratings for all items.

    A strength of this formula is that it takes into account average ratings for each user. So if user a rates everything a 5 and user u rates everything a 1, then they still have a similarity rating of 1.

    Now if users only have a couple of items in common, then their similarity should have less of an influence on the overall calculation. So...
  2. Weight the similarity between users to account for the number of items in common. If the number of items in common is below some threshold, then the similarity should be multiplied by the number of items in common and then divided by the threshold; otherwise, the similarity should be left unchanged. Assume the threshold is set to 20 common items; although Frank and Jane have rated on many items, they've only rated on 13 items in common. So their similarity score should be multiplied by 13/20. Conversely, Frank and Stein have rated 40 items in common, so their similarity score should be left unchanged.
  3. To find a rating prediction for a particular user for a particular item, first select a number of users with the highest, weighted similarity scores with respect to the current user that have rated on the item in question.
  4. Finally, average the ratings given to the item by the similar users to predict what the user under review would rate it. Run this process for each item to find recommended content for each user. All of these calculations may be performed during off-peak hours to ensure a quick response time. The prediction for a particular item is calculated as follows:


    pa, i is the predicted rating user a would give item i. n is the number of similar users being compared to. ru, i is the rating user u gave the item. ru with the bar over it is the average of all user u's ratings for all items.

There are a few items which may be tweaked to hone the algorithm:

  • Minimum Items in Common: The threshold to deem pairs of users non-similar if they do not have a minimum number of items in common.
  • Similarity Threshold: The threshold to dismiss pairings if they don't have at least a minimum similarity.
  • Similarity Weighting Threshold: The threshold to decrease similarity calculations based on the number of items in common. E.g. if set to 50, then only people with 50 or fewer items in common would have their similarity score multiplied by (number of items in common / 50).
  • Neighborhood Size: The most number of people that will be considered for similarity when predicting the rating a user would give an item.

For more fun, a genetic algorithm my be employed to find a close-to-optimal combination of these variables for your environment. Note that these variables may also change as more content, users and ratings are added.

Applying the Pearson Correlation Coefficient to the Netflix Challenge

If you've imported all the Netflix data into SQL Server, you can use the following SQL to calculate the Pearson Correlation Coefficient for every unique user combination. As each user is completed, a "global_stats" table is updated to keep track of that user. Therefore, you can stop the process at anytime and restart it at a later time and it will pick up where it left off. There are only a few tables involved; the following lists the tables and their columns:

  • movies: id
  • users: id, average_rating
  • ratings: user_fk, movie_fk, rating
  • user_relationships: user_1_fk, user_2_fk, movies_in_common, similarity

And to populate the user_relationships table using the Pearson correlation coefficient...

DECLARE @userIdMin int

DECLARE @userId1 int

 

SET @userIdMin = (

    SELECT value

    FROM global_stats

    WHERE stat_name = 'last_user_analyzed'

)

 

DECLARE allUsers1 CURSOR FOR

    SELECT id

    FROM users

    WHERE id > @userIdMin

    ORDER BY id

 

OPEN allUsers1

 

FETCH NEXT FROM allUsers1 INTO @userId1

WHILE @@FETCH_STATUS = 0

BEGIN

    INSERT INTO user_relationships

    SELECT @userId1, r2.user_fk, COUNT(r1.user_fk),

        (SUM((r1.rating - u1.average_rating) *

        (r2.rating - u2.average_rating)))

        /

        (SQRT((SUM(SQUARE(r1.rating - u1.average_rating)))

        * (SUM(SQUARE(r2.rating - u2.average_rating))))

        -- Add a negligible amount to avoid division by zero

        + 0.00001)

        AS similarity

    FROM ratings r1, ratings r2, users u1, users u2

    WHERE r1.user_fk = @userId1

    AND r2.user_fk > @userId1

    AND r1.movie_fk = r2.movie_fk

    AND r1.user_fk = u1.id

    AND r2.user_fk = u2.id

    GROUP BY r1.user_fk, r2.user_fk

    -- Eliminate any comparisons

    -- with 9 or fewer movies in common

    HAVING COUNT(r1.user_fk) >= 10

 

    -- Clear the transaction log every 10 users

    IF @userId1 % 10 = 0

    BEGIN

        DUMP TRAN Netflix WITH NO_LOG

        DBCC SHRINKFILE (Netflix_log, 2) WITH NO_INFOMSGS

    END

 

    -- Keep a tab as to which user was just completed

    UPDATE global_stats

    SET value = @userId1

    WHERE stat_name = 'last_user_analyzed'

 

    FETCH NEXT FROM allUsers1 INTO @userId1

END

 

CLOSE allUsers1

DEALLOCATE allUsers1

The SQL for calculating the prediction looks relatively simple, based on the prediction calculation shown above, but it's actually a bit tricky as you only want the similarity multiplied by (n / threshold) if the number of movies in common is less than the threshold. The SQL needed for making a movie predication based on PCC calculations follows:

DECLARE @userToBePredictedId int
SET @userToBePredictedId = <INSERT USER ID HERE>
DECLARE @movieToBePredictedId smallint
SET @movieToBePredictedId = <INSERT MOVIE ID HERE>
DECLARE @neighborhoodSize tinyint
SET @neighborhoodSize = 30
DECLARE @commonMoviesWeightingThreshhold tinyint
SET @commonMoviesWeightingThreshhold = 50
SELECT TOP (@neighborhoodSize) user_1_fk, user_2_fk, movies_in_common, similarity, CAST(weighted_similarity AS decimal(5, 4))
FROM (
  SELECT ur.user_1_fk, user_2_fk, movies_in_common, similarity, 
    -- Calculate weighted similarity based on # of movies-in-common
    (similarity * 
     ((CAST(movies_in_common AS decimal(5, 0)) / @commonMoviesWeightingThreshhold) 
      -- Subtract everything > 1 if the value itself is > 1
      - (CAST(FLOOR(CAST(movies_in_common AS decimal(5, 0)) / @commonMoviesWeightingThreshhold) AS bit)
         * ((CAST(movies_in_common AS decimal(5, 0)) / @commonMoviesWeightingThreshhold) - 1)))) as weighted_similarity
  FROM user_relationships ur, ratings
  WHERE user_1_fk = @userToBePredictedId
  AND user_2_fk = ratings.user_fk
  AND ratings.movie_fk = @movieToBePredictedId
  UNION
  SELECT ur.user_1_fk, user_2_fk, movies_in_common, similarity,
    -- Calculate weighted similarity based on # of movies-in-common
    (similarity * ((CAST(movies_in_common AS decimal(5, 0)) / @commonMoviesWeightingThreshhold) 
     -- Subtract everything > 1 if the value itself is > 1
     - (CAST(FLOOR(CAST(movies_in_common AS decimal(5, 0)) / @commonMoviesWeightingThreshhold) AS bit)
     * ((CAST(movies_in_common AS decimal(5, 0)) / @commonMoviesWeightingThreshhold) - 1)))) as weighted_similarity
  FROM user_relationships ur, ratings
  WHERE user_2_fk = @userToBePredictedId
  AND user_1_fk = ratings.user_fk
  AND ratings.movie_fk = @movieToBePredictedId
) AS neighborhood
ORDER BY weighted_similarity DESC

As mentioned previously, a few of the variables found within the code could be optimized for the Netflix challenge, specifically, with the use of a genetic algorithm.

I'd love to hear if you've used these types of calculations in your own project work. Also, if you happen to come across a spare supercomputer for me to use to bust out a couple billion Pearson Correlation Coefficients for the Netflix challenge...please let me know!

Billy


Posted 11-07-2006 1:48 PM by Billy McCafferty
Filed under: ,

[Advertisement]

Comments

Pierres Service » Blog Archive » netflix memoirs: using the pearson correlation coefficient wrote Pierres Service &raquo; Blog Archive &raquo; netflix memoirs: using the pearson correlation coefficient
on 11-29-2006 4:19 AM

About The CodeBetter.Com Blog Network
CodeBetter.Com FAQ

Our Mission

Advertisers should contact Brendan

Subscribe
Google Reader or Homepage

del.icio.us CodeBetter.com Latest Items
Add to My Yahoo!
Subscribe with Bloglines
Subscribe in NewsGator Online
Subscribe with myFeedster
Add to My AOL
Furl CodeBetter.com Latest Items
Subscribe in Rojo

Member Projects
DimeCasts.Net - Derik Whittaker

Friends of Devlicio.us
Red-Gate Tools For SQL and .NET

NDepend

SlickEdit
 
SmartInspect .NET Logging
NGEDIT: ViEmu and Codekana
LiteAccounting.Com
DevExpress
Fixx
NHibernate Profiler
Unfuddle
Balsamiq Mockups
Scrumy
JetBrains - ReSharper
Umbraco
NServiceBus
RavenDb
Web Sequence Diagrams
Ducksboard<-- NEW Friend!

 



Site Copyright © 2007 CodeBetter.Com
Content Copyright Individual Bloggers

 

Community Server (Commercial Edition)