Images in this post missing? We recently lost them in a site migration. We're working to restore these as you read this. Should you need an image in an emergency, please contact us at imagehelp@codebetter.com

Netflix Memoirs: Using the Pearson Correlation Coefficient

Updated 2/12/07 to include SQL for computing movie prediction.

Slugging my way through the Netflix challenge, I've come across a number of computational techniques of which I was previously unaware. One such technique, using the Pearson correlation coefficient to find similar movie raters, seems useful in a wide variety of situations. What follows is a brief introduction to this calculation and its applicability in project work.

Imagine you're developing an intranet to be used by the entire company. The intranet will house forums, business documents, contacts, HR forms and plenty of articles from both internal and external sources. A common complaint from users of large intranets is that it is difficult to search through the sea of information to find useful content, pertinent to the task at hand. Unless it's something HR related, it's often easier to ask Google than to look within an intranet. One of the strengths that makes Google so great is the fact that it usually returns very applicable results. Your intranet would likely be more useful, and more used, if this same level of applicability could be applied to the employee's experience.

Combining content-ratings with the Pearson correlation coefficient, you can preemptively suggest content that the employee may find useful. For example, suppose an employee has rated a number of bridge-construction content items highly. Incidentally, a number of bridges have been completed in the past and the intranet has a plethora of content related to the subject. The problem is having to scour all the forums, documents, articles, etc to find other content that may be useful. By encouraging users to rate content, you can look for items that similar users have rated highly and proactively present the results to the user.

There are a number of techniques that may be employed to find suggested content; what's described here utilizes the fully-computed Pearson correlation coefficient. I mention "fully-computed" as it suggests performing the calculation on every unique, user combination. Although relatively accurate, it does not scale well for many users. If you have a lot of users, over a few thousand or so, you can augment the technique with cluster-based calculations. By clustering users into a few categories of similarity, it greatly decreases the number of users to compare to. (The details of performing cluster-based calculations are beyond the scope of this post.)

**Using the Pearson Correlation Coefficient**

The basic idea of finding recommended content based on the Pearson correlation coefficient is as follows:

- Calculate the "similarity" between each user by comparing how each user has rated common content. If Frank has rated something 4/5 stars and Jane has also rated it 4/5 stars, then these users would be considered similar. These calculations are very time consuming as it essentially becomes the "handshake" problem. I.e. the calculation has to be performed for each unique combination of users. The number of unique combinations is: n (n - 1) / 2. For the Netflix challenge, the number of unique combinations is 115,290,497,766...yes that's 115
*billion*...that's a lot of handshaking! Even with just a few users, these calculations should be performed offline.

Similarity between users is calculated using the Pearson Correlation Coefficient. This calculation returns a value between -1 and 1. Two users with a similarity of 1 have rated every item identically. The formula is as follows:

P_{a, u}is the similarity between users*a*and*u*.*m*is the total number of items in common. r_{a, i}is the rating user*a*gave an item. r_{a}with the bar over it is the average of all user*a*'s ratings for all items.

A strength of this formula is that it takes into account average ratings for each user. So if user*a*rates everything a 5 and user*u*rates everything a 1, then they still have a similarity rating of 1.

Now if users only have a couple of items in common, then their similarity should have less of an influence on the overall calculation. So... - Weight the similarity between users to account for the number of items in common. If the number of items in common is below some threshold, then the similarity should be multiplied by the number of items in common and then divided by the threshold; otherwise, the similarity should be left unchanged. Assume the threshold is set to 20 common items; although Frank and Jane have rated on many items, they've only rated on 13 items in common. So their similarity score should be multiplied by 13/20. Conversely, Frank and Stein have rated 40 items in common, so their similarity score should be left unchanged.
- To find a rating prediction for a particular user for a particular item, first select a number of users with the highest, weighted similarity scores with respect to the current user that have rated on the item in question.
- Finally, average the ratings given to the item by the similar users to predict what the user under review would rate it. Run this process for each item to find recommended content for each user. All of these calculations may be performed during off-peak hours to ensure a quick response time. The prediction for a particular item is calculated as follows:

p_{a, i}is the predicted rating user*a*would give item*i*.*n*is the number of similar users being compared to. r_{u, i}is the rating user*u*gave the item. r_{u}with the bar over it is the average of all user*u*'s ratings for all items.

There are a few items which may be tweaked to hone the algorithm:

*Minimum Items in Common*: The threshold to deem pairs of users non-similar if they do not have a minimum number of items in common.*Similarity Threshold*: The threshold to dismiss pairings if they don't have at least a minimum similarity.*Similarity Weighting Threshold*: The threshold to decrease similarity calculations based on the number of items in common. E.g. if set to 50, then only people with 50 or fewer items in common would have their similarity score multiplied by (number of items in common / 50).*Neighborhood Size*: The most number of people that will be considered for similarity when predicting the rating a user would give an item.

For more fun, a genetic algorithm my be employed to find a close-to-optimal combination of these variables for your environment. Note that these variables may also change as more content, users and ratings are added.

**Applying the Pearson Correlation Coefficient to the Netflix Challenge**

If you've imported all the Netflix data into SQL Server, you can use the following SQL to calculate the Pearson Correlation Coefficient for every unique user combination. As each user is completed, a "global_stats" table is updated to keep track of that user. Therefore, you can stop the process at anytime and restart it at a later time and it will pick up where it left off. There are only a few tables involved; the following lists the tables and their columns:

*movies*: id*users*: id, average_rating*ratings*: user_fk, movie_fk, rating*user_relationships*: user_1_fk, user_2_fk, movies_in_common, similarity

And to populate the user_relationships table using the Pearson correlation coefficient...

DECLARE @userIdMin int

DECLARE @userId1 int

SET @userIdMin = (

SELECT value

FROM global_stats

WHERE stat_name = 'last_user_analyzed'

)

DECLARE allUsers1 CURSOR FOR

SELECT id

FROM users

WHERE id > @userIdMin

ORDER BY id

OPEN allUsers1

FETCH NEXT FROM allUsers1 INTO @userId1

WHILE @@FETCH_STATUS = 0

BEGIN

INSERT INTO user_relationships

SELECT @userId1, r2.user_fk, COUNT(r1.user_fk),

(SUM((r1.rating - u1.average_rating) *

(r2.rating - u2.average_rating)))

/

(SQRT((SUM(SQUARE(r1.rating - u1.average_rating)))

* (SUM(SQUARE(r2.rating - u2.average_rating))))

-- Add a negligible amount to avoid division by zero

+ 0.00001)

AS similarity

FROM ratings r1, ratings r2, users u1, users u2

WHERE r1.user_fk = @userId1

AND r2.user_fk > @userId1

AND r1.movie_fk = r2.movie_fk

AND r1.user_fk = u1.id

AND r2.user_fk = u2.id

GROUP BY r1.user_fk, r2.user_fk

-- Eliminate any comparisons

-- with 9 or fewer movies in common

HAVING COUNT(r1.user_fk) >= 10

-- Clear the transaction log every 10 users

IF @userId1 % 10 = 0

BEGIN

DUMP TRAN Netflix WITH NO_LOG

DBCC SHRINKFILE (Netflix_log, 2) WITH NO_INFOMSGS

END

-- Keep a tab as to which user was just completed

UPDATE global_stats

SET value = @userId1

WHERE stat_name = 'last_user_analyzed'

FETCH NEXT FROM allUsers1 INTO @userId1

END

CLOSE allUsers1

DEALLOCATE allUsers1

The SQL for calculating the prediction looks relatively simple, based on the prediction calculation shown above, but it's actually a bit tricky as you only want the similarity multiplied by (*n* / threshold) if the number of movies in common is less than the threshold. The SQL needed for making a movie predication based on PCC calculations follows:

DECLARE @userToBePredictedId int SET @userToBePredictedId = <INSERT USER ID HERE> DECLARE @movieToBePredictedId smallint SET @movieToBePredictedId = <INSERT MOVIE ID HERE> DECLARE @neighborhoodSize tinyint SET @neighborhoodSize = 30 DECLARE @commonMoviesWeightingThreshhold tinyint SET @commonMoviesWeightingThreshhold = 50 SELECT TOP (@neighborhoodSize) user_1_fk, user_2_fk, movies_in_common, similarity, CAST(weighted_similarity AS decimal(5, 4)) FROM ( SELECT ur.user_1_fk, user_2_fk, movies_in_common, similarity, -- Calculate weighted similarity based on # of movies-in-common (similarity * ((CAST(movies_in_common AS decimal(5, 0)) / @commonMoviesWeightingThreshhold) -- Subtract everything > 1 if the value itself is > 1 - (CAST(FLOOR(CAST(movies_in_common AS decimal(5, 0)) / @commonMoviesWeightingThreshhold) AS bit) * ((CAST(movies_in_common AS decimal(5, 0)) / @commonMoviesWeightingThreshhold) - 1)))) as weighted_similarity FROM user_relationships ur, ratings WHERE user_1_fk = @userToBePredictedId AND user_2_fk = ratings.user_fk AND ratings.movie_fk = @movieToBePredictedId UNION SELECT ur.user_1_fk, user_2_fk, movies_in_common, similarity, -- Calculate weighted similarity based on # of movies-in-common (similarity * ((CAST(movies_in_common AS decimal(5, 0)) / @commonMoviesWeightingThreshhold) -- Subtract everything > 1 if the value itself is > 1 - (CAST(FLOOR(CAST(movies_in_common AS decimal(5, 0)) / @commonMoviesWeightingThreshhold) AS bit) * ((CAST(movies_in_common AS decimal(5, 0)) / @commonMoviesWeightingThreshhold) - 1)))) as weighted_similarity FROM user_relationships ur, ratings WHERE user_2_fk = @userToBePredictedId AND user_1_fk = ratings.user_fk AND ratings.movie_fk = @movieToBePredictedId ) AS neighborhood ORDER BY weighted_similarity DESC

As mentioned previously, a few of the variables found within the code could be optimized for the Netflix challenge, specifically, with the use of a genetic algorithm.

I'd love to hear if you've used these types of calculations in your own project work. Also, if you happen to come across a spare supercomputer for me to use to bust out a couple billion Pearson Correlation Coefficients for the Netflix challenge...please let me know!

Billy

Posted 11-07-2006 1:48 PM by Billy McCafferty

Filed under: SQL Server, Algorithms

[Advertisement]

on 11-29-2006 4:19 AM