This is part of our series on advice for building recommender systems. For additional recsys tips, check out this post on why product beats math.
To make content or product recommendations in industries where users don’t usually give explicit ratings to items, a recommender system needs to infer which items a user likes from implicit signals. These signals can include views, purchases, favorites lists and wishlists, shares on social networks, or even sentiment analysis on natural language.
It’s also important to estimate how significant each signal is. A purchase and a tweet are both pretty strong signals that a user likes an item; but if a user mentions “Sliced Bread” and “Nutella” together in a tweet, then it’s more likely that Sliced Bread and Nutella reflect a similar taste than if the user had only purchased the two items (e.g. the user could favor peanut-butter, and is buying the Nutella for someone else).
Turning all of this information into a single score for each user-item pair can be tricky, because any assignment of weights to signal types makes an assumption about user behavior. This post will give some tips on how to examine this problem from a statistical perspective, and link to some Pig code implementing the methods described.
A first step to take is to examine the relative frequency of each of the types of signals. Intuitively, if views are common but tweets are rare, then tweets ought to be more significant than views. Simply comparing the total occurrences of each signal type can be misleading though.
Suppose you find that “on average” 1 in 250 user interactions for your service leads to a tweet. You can’t assume from this average alone that “most” users tweet 1 in 250 times. More likely, a few users are prolific tweeters, informing the world what they had for breakfast, while most tweet much less often.
This disparity can be characterized by an exponential distribution (blue curve below), which is often used to model frequency of events, or a Pareto distribution (red curve below), which is like the exponential distribution but “heavy-tailed”, meaning that outliers are a greater proportion of the population (note that it is only defined for values >= 1).
To account for the non-normal distribution of signal frequencies across users, you can assign values to signals on a per-user basis. This means that a tweet is more or less valuable depending on the individual user’s propensity for tweeting. On the other hand, if a user has only ever tweeted once, your model shouldn’t fixate on it or conclude that it was a life-changing event for them.
In order to balance these two concerns, instead of using the raw observed frequencies of each signal type for each user, you can use a “Bayesian estimate” of the frequencies, which takes into account both the observed frequencies for a user and also the global frequencies of each signal type across all users. It tends towards the global frequencies if there is little data for a user, or the user-specific frequencies if there is plenty of data. In other words, the model is “skeptical” of outlier results at first, but “changes its mind” as it sees more data.
If you are familiar with Bayesian analysis (if not, see this book for a great, programmer-friendly introduction), here are the ideas of the last three paragraphs in mathematical notation:
- weight(signal_type | user) = 1 / P(random_event == signal_type | user)
- P(random_event == signal_type | user) = E[X | observations], X ~ Bernoulli(p)
- Use a conjugate prior Beta(α, β) with empirically derived hyperparameters
- α = some statistic (e.g. median or 25th percentile) of the distribution of # signals of this type per user
- β = some statistic of the distribution of total # signals observed per user
- X | observations ~ Beta(α’, β’) (c.f. Wikipedia’s table of posteriors for conjugate priors)
- α’ = α + [# observed signals of this type for this user]
- β’ = β + [# observed signals not of this type for this user]
- E[X | observations] = α’ / (α’ + β’)
- Substitute this value into the original equation for weight(signal_type | user)
Here is a sample implementation of this algorithm in Pig: https://gist.github.com/jspacker/6058763.
Please note that no automated process can substitute for real domain knowledge, and there is no “cookie-cutter” algorithm that will work out of the box for any recommender system. This is meant to serve as an inspiration and/or template on top of which domain-specific knowledge and feedback from product managers (who you definitely should stay in close communication with!) can be applied.
Two immediate limitations of the algorithm are:
- A signal may be rare but also not significant: maybe it’s just not relevant to most users, or maybe it’s hidden behind a bad user-interface. The sample code adds an additional factor into the formula to address this.
- The algorithm does not consider temporal factors, such as how long a user has been active, or how long a type of signal has been being observed. How to integrate this information is likely to be very domain-and-company specific.
Finally, there is one last small factor to consider. After choosing signal weightings for each user, you add up the signal values for each (user, item) pair. The recsys ought to have an appropriate scale to compare these aggregate values for across different (user, item) pairs. In my experience, using a scale that models “diminishing returns” seems to work well.For example, if user “Hrothgar” has a score of 1 with item “mead”, and user “Beowulf” has a score of 2 with “mead”, then yes, Beowulf probably likes mead twice as much as Hrothgar. But if Hrothgar’s score is 100 and Beowulf’s is 200, then it is more logical to say just that mead is an important part of both of their lives: their love for it is beyond quantification. You can model diminishing returns with a logistic scaling function (illustrated below), which always returns a value between 0 and 1.
We are left with (user, item, value) triples where each value is between 0 and 1, representing the degree to which the user likes that item.
Through our work above, we’ve used implicit signals to create a synthetic ratings matrix that properly accounts for both the relative significance signals and their absolute numeral scale. In other words, leveraging Bayesian estimation and logistic scaling allows us to create a replacement for actual user ratings whenever a recommender system algorithm requires them.