GitHub is an amazing platform for collaborating on open-source projects, but it can be hard to find a project that’s relevant to your specific interests—especially one worth contributing to. That’s why we’ve built an app to recommend you repos based on your individual history on GitHub (UPDATE 8/20/13: We’ve also added a Chrome Extension to allow you to get recommendations without leaving GitHub.) You can also get recommendations similar to a specific repo you’re interested in. This post will give an overview of how we built the recommender system that powers the app.
At Mortar, we use Apache Pig and the Mortar Development Framework to make algorithms easy to develop and easy to scale. So easy in fact, that performance was never a concern when building the recommender: I developed the algorithm in Apache Pig with a subset of the data on my laptop using Mortar’s local development mode, and then used Mortar to run it on a 10-node EC2 cluster when it came time to process the whole dataset. This outputted recommendations for each GitHub user as a flat file in S3, which we then imported into DynamoDB which backs the web app.
The overall idea behind the recommender algorithm is simple:
- Build a graph where edges represent interactions between users and repos
- Use that graph to find similarities between repos
- Recommend to you repos similar to those you’ve interacted with
Step 1 takes the raw GitHub event logs from the GitHub Archive and generates a graph from them. We extract four signals of interest: user watches repo; user forks repo, user sends a pull request to a repo; and user pushes to a repo. We then aggregate all of a user’s interactions with any given repo, and scale it so that it becomes a single number between 0 and 1 representing how engaged the user is with that repo. The “user-repo affinity” graph is the collection of all of these links.
Step 2 uses the user-repo affinity graph to find repo-to-repo similarities, i.e. find the repos most similar to, for example, linkedin/datafu. We do this by saying that for any user U, if they have an affinity with both repo A and repo B, then there is a link between A and B. We use these links and a formula based on Bayes theorem to estimate the probability that any random user will interact with repo B given that they interacted with repo A. That probability is our similarity metric.
Step 3 uses the repo-to-repo similarities to find “what repos are most similar to the repos a user has interacted with”. The main challenge here was ranking all the candidate recommendations: we take into account both the affinity of the user to the “reason” repo (the one they interacted with in the first place that would cause the candidate recommendation to be made) and the overall popularity of the candidate recommendation.
Check out the GitRec site or the Chrome Extension if you’d like to see more. In addition, if you’re interested in the details of the algorithm, or just want to see production-ready Pig code, we’ve open sourced the entire project. The recommender code is here. If you have improvements feel free to send me a pull request.
Like what you see? We’re also selecting a few companies to get free, custom-built recommenders.