How to build a content-based recommender system that uses multiple attributes?

Question

I want to build a content-based recommender system in Python that uses multiple attributes to decide whether two items are similar. In my case, the "items" are packages hosted by the C# package manager ( example ) that have various attributes such as name, description, tags that could help to identify similar packages.

I have a prototype recommender system here that currently uses only a single attribute, the description, to decide whether packages are similar. It computes TF-IDF rankings for the descriptions and prints out the top 10 recommendations based on that:

# Code mostly stolen from http://blog.untrod.com/2016/06/simple-similar-products-recommendation-engine-in-python.html

def train(dataframe):
    tfidf = TfidfVectorizer(analyzer='word',
                            ngram_range=(1, 3),
                            min_df=0,
                            stop_words='english')
    tfidf_matrix = tfidf.fit_transform(dataframe['description'])
    cosine_similarities = linear_kernel(tfidf_matrix, tfidf_matrix)
    for idx, row in dataframe.iterrows():
        similar_indices = cosine_similarities[idx].argsort()[:-10:-1]
        similar_items = [(dataframe['id'][i], cosine_similarities[idx][i])
                        for i in similar_indices]

        id = row['id']
        similar_items = [it for it in similar_items if it[0] != id]
        # This 'sum' is turns a list of tuples into a single tuple:
        # [(1,2), (3,4)] -> (1,2,3,4)
        flattened = sum(similar_items, ())
        try_print("Top 10 recommendations for %s: %s" % (id, flattened))

How can I combine cosine_similarities with other similarity measures (based on same author, similar names, shared tags, etc.) to give more context to my recommendations?

Answer 1

For some context, my work with content-based recommenders has revolved primarily around raw text and categorical data/features. Here's a high-level approach I've taken that has worked out nicely and is pretty simple to implement.

Suppose I have three feature columns that I can potentially use to make recommendations: description , name , and tags . To me, the path of least resistance entails combining these three feature sets in a useful way.

You're off to a good start, using TF-IDF to encode description . So why not treat name and tags in a similar way by creating a feature "corpus" consisting of description , name , and tags ? Literally, this would mean concatenating the contents of each of the three columns into one long text column.

Be wise about the concatenation, though, as it's probably to your advantage to preserve from which column a given word comes from , in the case of features like name and tag , which are assumed to have much lower cardinality than description . To put it more explicitly: instead of just creating your corpus column like this:

df['corpus'] = (pd.Series(df[['description', 'name', 'tags']]
                .fillna('')
                .values.tolist()
                ).str.join(' ')

You might try preserving information about where particular data points in name and tags come from. Something like this:

df['name_feature'] = ['name_{}'.format(x) for x in df['name']]
df['tags_feature'] = ['tags_{}'.format(x) for x in df['tags']]

And after you do that, I would take things a step further by considering how the default tokenizer (which you're using above) works in TfidfVectorizer . Suppose you have the name of a given package's author: "Johnny 'Lightning' Thundersmith". If you just concatenate that literal string, the tokenizer will split it up and roll each of "Johnny", "Lightning", and "Thundersmith" into separate features , which could potentially diminish the information added by that row's value for name . I think it's best to try to preserve that information. So I would do something like this to each of your lower-cardinality text columns (eg name or tags ):

def raw_text_to_feature(s, sep=' ', join_sep='x', to_include=string.ascii_lowercase):
    def filter_word(word):
        return ''.join([c for c in word if c in to_include])
    return join_sep.join([filter_word(word) for word in text.split(sep)])

def['name_feature'] = df['name'].apply(raw_text_to_feature)

The same sort of critical thinking should be applied to tags . If you've got a comma-separated "list" of tags, you'll probably have to parse those individually and figure out the right way to use them.

Ultimately, once you've got all of your <x>_feature columns created, then you can create your final "corpus" and plug that into your recommender system as inputs.

This whole system takes some engineering, to be sure, but I've found it's the easiest way to introduce new information from other columns that have different cardinalities.

Answer 2

As I understand your question, there are two ways this can be done:

Combine the other features with tfidf_matrix and then calculate the cosine similarity
Calculate the similarity of other features using other methods and then somehow combine them with the cosine similarity of tfidf_matrix to get a meaningful metric.

I was talking about the first one.

For example lets say, for your data, the tfidf_matrix (for only the 'description' column) is of shape (3000, 4000) where 3000 are the rows in the data and 4000 are the unique words (vocabulary) found by the TfidfVectorizer.

Now lets say you do some feature processing on the other columns ('authors', 'id' etc) and that produces 5 columns. So the shape of that data is (3000, 5) .

I was saying to combine the two matrices (combine the columns) so that the new shape of your data is (3000, 4005) and then calculate the cosine_similarity.

See below example:

from scipy import sparse

# This is your original matrix
tfidf_matrix = tfidf.fit_transform(dataframe['description'])

# This is the other features
other_matrix = some_processing_on_other_columns()
combined_matrix = sparse.hstack((tfidf_matrix, other_matrix))

cosine_similarities = linear_kernel(combined_matrix, combined_matrix)

Answer 3

You have a vector for a user $\\gamma_u$ and an item $\\gamma_i$. The scoring function for your recommendation is:

$f = \\alpha + \\beta_u +\\beta_i + \\gamma_u^T \\gamma_i$

Right now you said your feature vector has only 1 item, but once you get more, this model will scale for that.

In this case you already engineered your vectors, but typically in recommenders, the feature are learned through matrix factorization. This is called a latent factor model, whereas you have a hand-crafted model.

How to build a content-based recommender system that uses multiple attributes?

Question

3 answers

solution1
5 2018-01-14 21:40:25

solution2
3 2018-01-15 09:22:10

solution3
2 2018-01-14 21:40:00

How to build a content-based recommender system that uses multiple attributes?

Question

3 answers

solution1 5 2018-01-14 21:40:25

solution2 3 2018-01-15 09:22:10

solution3 2 2018-01-14 21:40:00

solution1
5 2018-01-14 21:40:25

solution2
3 2018-01-15 09:22:10

solution3
2 2018-01-14 21:40:00