Data Sorceress's Blog: September 2012

Lately, I've been reading a lot about recommender systems. There is quite an interest in both academia and industry in improving recommendation engines. The vast use of mobile devices and social networking sites pave the way for recommender systems to be hugely important. People of this era have become more open about sharing their interests, tastes, likes and dislikes publicly. There is a tremendous amount of personal information about users that can be collected from multiple sources such as social networking sites, photo sharing applications, etc. The effective use of this data combined with state-of-the-art machine learning techniques make personalized recommendations for places, people, things, music, events, and many other areas possible. I believe that there is still quite a room for improvement and the standard challenges like cold-start problem still remain. However, there has been very encouraging developments which signal that there is more to come.

I'm sure to keep writing on this topic but I'll start with analyzing and summarizing a paper from WSDM 2011 titled "Recommender Systems with Social Regularization" . The reason I chose it is two-folds. First, it is a relatively recent paper from a well-respected conference. Second, it covers important topics like matrix factorization for recommender systems embedded with a social aspect. Hence, it'll be a good learning exercise to go through the proposed algorithm of this paper in depth. I highly recommend reading the paper for details that I skip here.

The authors' main model relies on low-rank matrix factorization. Assume you have an $m\times n$ user-item rating matrix $R$ where each row represents the ratings of a corresponding user for $n$ items. Matrix factorization techniques try to approximate this rating matrix by lower rank factor matrices; $R=U^TV$ where $U\in\mathbb{R}^{l \times m}$ and $V\in\mathbb{R}^{l \times n}$. You can think of $U$ and $V$ as factor matrices where each column is a vector in the $l$-dimensional space, and $l$ is smaller than both $m$ and $n$. In other words, you project the user and item rating vectors into a lower dimensional space where you represent each user in terms of their interest/emphasis on the corresponding factors, and each item in terms of its possession of these characteristics. Ideally, one would use Singular Value Decomposition (SVD) to find the estimates for $U$ and $V$. However, rating matrices in almost all recommendation problems are extremely sparse and traditional SVD is undefined for incomplete matrices. The workaround is to consider only the observed ratings. As with any problem where you do optimization on high dimensional data with insufficient samples, the danger of overfitting arises. We're lucky enough that regularization comes to the rescue. The following general objective function allows us to avoid overfitting via regularizing the factor matrices:

$$\min_{U,V} \frac{1}{2}\sum^m_{i=1}\sum^n_{j=1}I_{ij}(R_{ij} - U^T_iV_j)^2+\frac{\lambda_1}{2}||U||^2_F + \frac{\lambda_2}{2}||V||^2_F$$

Some simple and yet useful tips here:
1. $\lambda_1$ and $\lambda_2$ are regularization coefficients which are generally tuned via cross-validation.
2. The purpose of division by $2$ is purely mathematical, to make the derivations easier to handle.
3. $I_{ij}$ is an indicator function that is $1$ is the user $i$ rated item $j$, and $0$ otherwise.
4. $F$ indicates we are taking the Frobenius (a.k.a. matrix or Euclidean norm in other contexts) norm.
This optimization aims to minimize the sum of squared errors with quadratic regularization terms. It can easily be solved via gradient-descent based techniques which updates the current parameters by taking a step in the direction of the negative gradient.

The authors add a social constraint to the above objective such that a user's tastes (preferences) are similar to her friends. The closer the friendship the more impact on the user. They formulate this claim with an additional regularization term that biases the user factor vector towards the weighted average of her friends:

$$\min_{U,V} \frac{1}{2}\sum^m_{i=1}\sum^n_{j=1}I_{ij}(R_{ij} - U^T_iV_j)^2+\frac{\lambda_1}{2}||U||^2_F + \frac{\lambda_2}{2}||V||^2_F $$
$$+\frac{\alpha}{2}\sum^m_{i=1}|| U_i - \frac{\sum_{f\in\mathbf{F_i}}Sim(i,f)\times U_f }{\sum_{f\in\mathbf{F_i}}Sim(i,f)} ||^2_F$$

Note that I'm using a slightly different notation than that of the paper. The authors only compute the proximity-based weighted average on what they call "outlink" friends; e.g. the people you follow in Twitter as opposed the people who follow you (they would be "inlink" friends in the authors' definition). Initially, it was unclear to me why they had to make this distinction. However, the derivative calculations later made it clear as to why. I skip the derivative calculations here as they are pretty straightforward algebra, please check the paper if you're interested. I still think that it is more intuitive to consider a user's entire circle without distinguishing between outlink and inlink friends. It is possible that inlink and outlink friends capture different types of similar tastes and preferences; hence disregarding one group seems like throwing away potentially valuable information.

Another objection is that the above framework is likely to be more effective for small communities where friends share very similar tastes/interests. The authors try to address this concern with a different type of social regularization minimizing the distance between the user's preferences and that of each of her friends individually, as opposed to minimizing the distance to the average of her friends' tastes:

$$\min_{U,V} \frac{1}{2}\sum^m_{i=1}\sum^n_{j=1}I_{ij}(R_{ij} - U^T_iV_j)^2+\frac{\lambda_1}{2}||U||^2_F + \frac{\lambda_2}{2}||V||^2_F $$
$$+\frac{\beta}{2}\sum^m_{i=1} {\sum_{f\in\mathbf{F_i}}Sim(i,f)}|| U_i - U_f ||^2_F$$

The authors claim one advantage of this approach is that it indirectly models propagation of preferences. For instance, it would minimize the distance between $U_i$ and $U_g$ even if they are not friends since it will minimize $||U_i - U_f||^2_F$ and $||U_f - U_g||^2_F$. I'm not sure if this claim holds. To me, it looks like one could achieve minimization of the upper bound on the distance $||U_i - U_g||^2_F$ via the triangular inequality:

$$ ||U_i - U_g||_F < ||U_i - U_f||_F + ||U_f - U_g||_F $$
This is my interpretation but I'd love to hear your comments and thoughts on this.

The authors propose vector space similarity or pearson correlation coefficient to compute the similarity between users. Pearson Correlation Coefficient is more effective especially since it assigns a high similarity to users who have similar tastes even though one tends to give consistently higher ratings than the other. Euclidean-based or Cosine-based similarity functions will dictate these pairs of users as dissimilar. The authors do not address the scalability issues with these types of pairwise similarity calculations in large datasets. Part of the reason is they only consider the pairwise similarities within the friend circles instead of the entire social graph. For problems which require more extensive such computations, there are near-neighbor detection algorithms such as min-hashing, locally sensitive hashing, etc. These algorithms realize that in any neighborhood based approach, one actually only considers the close neighborhood instead of the distance to every other data point in the input space. These techniques provide a way to efficiently compute these near-neighbors and hence any pairwise calculations can then be done on these neighborhoods. I'll talk about such algorithms in a near future post.

Data Sorceress's Blog

Friday, September 28, 2012

Recommendation Systems: Social and Regularized

Welcome Message

About Me