简体   繁体   中英

Cosine similarity between a combination of numerical and text values

I'm trying to do a simple content based filtering model on the Yelp dataset with data about the restaurants.
I have a DataFrame in this format

>>> business_df.dtypes
address          object
attributes       object
business_id      object
categories       object
city             object
hours            object
is_open          object
latitude        float64
longitude       float64
name             object
postal_code      object
review_count      int64
stars           float64
state            object

Now I'm trying to build a content-based collaborative filtering model where I'm answering the question "Given a restaurant, recommend similar restaurants"

I'm trying to implement a model given under Content-Based Recommender here - https://www.datacamp.com/community/tutorials/recommender-systems-python

Basically, they use some text fields to build a Count Vectorizer matrix and then do a cosine similarity on the rows to get similarity between movies.

They say later that

Introduce a popularity filter: this recommender would take the 30 most similar movies, calculate the weighted ratings (using the IMDB formula from above), sort movies based on this rating, and return the top 10 movies.

I'm trying to use the Categories, Attributes, Latitude and Logitude (for distance), Stars and Review Count(Stars weighted based on review count - higher number of reviews leads to more weightage for stars) to build a similar model.

But I don't know how to incorporate the numerical columns into the model here. I'm certain I cannot pass the numerical columns into the Count Vectorizer.

Can I build 2 models -- 1 with the text fields and other by simply calculating the cosine similarity(or Pearson correlation) between the numerical columns -- and combine those 2? If yes, how would I do that?

Or could I follow the Data camp model and do the text fields in a model, then use the formula to incorporate ratings? If yes, I would still be unable to do distance based on Latitude-Longitude

Let us assume that the CountVectorize r gives you a matrix C of shape (N, m) where N = number of restaurants and m = number of features (here the count of the words).

Now since you want to add numerical features, say you have k such features. You can simply compute these features for each movie and concatenate them to the matrix C . So for each movie now you will have (m+k) features. The shape of C will now be (N, m+k) . You can use pandas to concatenate.

Now you can simply compute the Cosine Similarity using this matrix and that way you are taking into account the text features as well as the numeric features

However, I would strongly suggest you normalize these values, as some of the numeric features might have larger magnitudes which might lead to poor results. Also instead of the CountVectorizer , TFIDF matrix or even word embeddings might give you better results

have you solved the problem? I got the same situation: I form a matric of hundreds of job titles and features (job requirement), and try to find the similar job in the data frame. I haven't found any methods to calculate similarity between numerical problems, since all values in data frame is float64. Help! thank you![enter image description here][1]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM