Using Collaborative Filtering to find people who share tastes, and for making automatic recommendations based on things that other people like.

## Problem

Finding items of interest to user, based on those of other users of similar taste.

## Solution

A collaborative filtering algorithm usually works by searching a large group of people and finding a smaller set with tastes similar to yours.

It looks at other things they like and combines them to create a ranked list of suggestions.

The underlying assumption is that if a person *A* has the same opinion as a person *B* on an issue, A is more likely to have B's opinion on a different issue than that of a randomly chosen person.

## Breakdown

### Collecting Preferences

The dataset we are working with is a text file with the following headers

**# Day, Action, UserID, UserName, ArticleID, ArticleName**

We need to transform that into a matrix of users and their preferences.

@Override public Map<Integer, Map<Integer, Double>> getUsersArticlesRatings(List<UserAction> userActions) { //group by users Map<Integer, List<UserAction>> usersActions = userActions .stream() .collect(groupingBy(UserAction::getUserId)); //group articles by id, calculate sum of ratings Map<Integer, Map<Integer, Double>> usersArticlesRatings = new HashMap<>(usersActions.size()); for (Map.Entry<Integer, List<UserAction>> entry : usersActions.entrySet()) { final Map<Integer, Double> articlesRatings = entry .getValue() .stream() .collect(groupingBy(UserAction::getArticleId, ratingCalculator.getCollector())); usersArticlesRatings.put(entry.getKey(), articlesRatings); } return usersArticlesRatings; }

User actions can either be `View`

, `DownVote`

, `UpVote`

or `Download`

.

No matter how preferences are expressed, we need a way to map them onto numerical values.

public class RatingCalculatorImpl implements RatingCalculator { private static Map<Action, Double> ACTION_WEIGHTS; static { ACTION_WEIGHTS = new HashMap<>(); ACTION_WEIGHTS.put(Action.View, 1d); ACTION_WEIGHTS.put(Action.UpVote, 1d); ACTION_WEIGHTS.put(Action.DownVote, -1d); ACTION_WEIGHTS.put(Action.Download, 2d); } @Override public Collector<UserAction, ?, Double> getCollector() { return summingDouble(value -> ACTION_WEIGHTS.get(value.getAction())); } }

### Finding Similar Users

After collecting data about the things people like, we need a way to determine how similar people are in their tastes. We do this by comparing each person with every other person and calculating a *similarity score*.

We will use Pearson correlation coefficient, it is a measure of the linear correlation between two variables *X* and *Y,* or how well two sets of data fit on a straight line.

**user 1**and

**user 141**make a straight line, correlation is

**1.0**

public class SimilarityCalculatorPearsonCorrelation implements SimilarityCalculator { @Override public double calculateScore(Map<Integer, Double> firstUserPreferences, Map<Integer, Double> secondUserPreferences) { List<Integer> commonArticles = getCommonArticles(firstUserPreferences.keySet(), secondUserPreferences.keySet()); if (commonArticles.isEmpty()) { return 0; } double sum1 = 0, sum2 = 0; double sumSq1 = 0, sumSq2 = 0; double sumProduct = 0; for (Integer articleId : commonArticles) { final Double user1Rating = firstUserPreferences.get(articleId); final Double user2Rating = secondUserPreferences.get(articleId); sum1 += user1Rating; sum2 += user2Rating; sumSq1 += Math.pow(user1Rating, 2); sumSq2 += Math.pow(user2Rating, 2); final double product = user1Rating * user2Rating; sumProduct += product; } // Calculate Pearson score int n = commonArticles.size(); double num = sumProduct - (sum1 * sum2 / n); double den = Math.sqrt((sumSq1 - Math.pow(sum1, 2) / n) * (sumSq2 - Math.pow(sum2, 2) / n)); if (den == 0) { return 0; } return num / den; } private List<Integer> getCommonArticles(Set<Integer> firstUserPreferences, Set<Integer> secondUserPreferences) { List<Integer> commonArticles = new ArrayList<>(); for (Integer entry : firstUserPreferences) { if (secondUserPreferences.contains(entry)) { commonArticles.add(entry); } } return commonArticles; } }

Now we can use the similarity score to caculate other users' taste compared to a given user.

private List<Recommendation> createScoreMatrix(int userId, Map<Integer, Map<Integer, Double>> matrix) { Map<Integer, Double> userPreferences = matrix.get(userId) == null ? new HashMap<>() : matrix.get(userId); List<Recommendation> recommendations = new ArrayList<>(matrix.size()); for (Map.Entry<Integer, Map<Integer, Double>> entry : matrix.entrySet()) { final Integer otherUserId = entry.getKey(); if (otherUserId == userId) { continue; } final Map<Integer, Double> otherUserPreferences = matrix.get(otherUserId); final double sim = similarityCalculator.calculateScore(userPreferences, otherUserPreferences); if (sim > 0) { // get articles not viewed by userId final Map<Integer, Double> preferences = otherUserPreferences .entrySet() .stream() .filter(e -> !userPreferences.containsKey(e.getKey())) .collect(Collectors.toMap(p -> p.getKey(), p -> p.getValue())); final Recommendation recommendation = new Recommendation(otherUserId, sim, preferences); recommendations.add(recommendation); } } return recommendations; }

### Recommending Items

Using the similarity score, one might be tempted to rank top n users with similar taste and look for articles he hasn't viewed yet, but such an approach could accidentally turn up reviewers who haven’t reviewed some of the articles that he might like.

We need to score the articles by producing a weighted score that ranks the users. Take the votes of all the other users and multiply how similar they are to a given user by the score they gave each article.

This table shows correlation scores for each user and the ratings they gave the three articles (760, 9, and 514) that user 1 hasn't rated. The columns beginning with S.x give the similarity multiplied by the rating, so a person who is similar to user 1 will contribute more to the overall score than a person who is different. The Total row shows the sum of all these numbers.

We could just use the totals to calculate the rankings, but then an article reviewed by more users would have a big advantage. To correct for this, we need to divide by the sum of all the similarities for critics that reviewed that movie (the `Sim. Sum`

row in the table).

private Map<Integer, WeightedScore> createWeightedScore(List<Recommendation> scoreMatrix) { Map<Integer, WeightedScore> totalRatings = new HashMap<>(); scoreMatrix .forEach(recommendation -> recommendation .getOtherUserPreferences() .forEach((articleId, rating) -> { if (!totalRatings.containsKey(articleId)) { totalRatings.put(articleId, new WeightedScore(0d, 0d)); } final WeightedScore weightedScore = totalRatings.get(articleId); final double total = weightedScore.getRatingSum() + rating; weightedScore.setRatingSum(total); if (recommendation.otherUserPreferences.containsKey(articleId)) { weightedScore.setSimSum(weightedScore.getSimSum() + recommendation.getSimilarity()); } totalRatings.put(articleId, weightedScore); })); return totalRatings; }

Now user 1 recommendations are:

Article Id, Rating 875:5.0 48:4.0 182:4.0 2427:4.0 483:4.0

We get a list of recommended articles & a predicted score of how user 1 will rate them.