简体   繁体   English

如何使用tf-idf和余弦相似度构建推荐系统?

[英]How to build a recommendation system using tf-idf and cosine similarity?

I have been trying to build a beer recommendation engine, I have decided to make it simply using tf-idf and Cosine similarity . 我一直在尝试构建啤酒推荐引擎,因此我决定仅使用tf-idf和Cosine相似度来使其成功。

Here is my code so far: ` 到目前为止,这是我的代码:

import pandas as pd     
import re
import numpy as np 
from bs4 import BeautifulSoup 
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
wnlzer = WordNetLemmatizer()


train = pd.read_csv("labeledTrainData.tsv" , header = 0 ,  \
    delimiter = '\t' , quoting  = 3)


def raw_string_to_list_clean_string( raw_train_review ):
    remove_html = BeautifulSoup( raw_train_review ).text
    remove_punch = re.sub('[^A-Za-z ]' , "" , remove_html)
    token = remove_punch.lower().split()
    srm_token = [wnlzer.lemmatize(i) for i in token if not i in set(stopwords.words('english'))]
    clean_text = " ".join(srm_token)
    return(clean_text)

ready_train_list = []
length  = len(train['review'])
for i in range(0 , length):
    if (i%100 == 0):
        print "doing  %d of  %d of training data set" % (i+1 , length)
    a = raw_string_to_list_clean_string(train['review'][i])
    ready_train_list.append(a)

vectorizer = TfidfVectorizer(analyzer = "word" , tokenizer = None , preprocessor = None , \
    stop_words = None , max_features = 20000)
training_our_vectorizer = vectorizer.fit_transform(ready_train_list)`

Now I know how to use cosine similarity but I am not able to figure out: 现在,我知道如何使用余弦相似度,但是我无法弄清楚:

  1. how to make use of cosine 如何利用余弦
  2. how to restrict the recommendation to a max of 5 beers 如何将建议限制为最多5杯啤酒

A simple implementation would be to compute the distance to each of the other beers using cdist , and then return your recommendations using argsort : 一个简单的实现是使用cdist计算到其他啤酒的距离,然后使用argsort返回您的建议:

from scipy.spatial.distance import cdist
import numpy as np

vec = TfidfVectorizer()
beerlist = np.array(['heinekin lager', 'corona lager', 'heinekin ale', 'budweiser lager'])
beerlist_tfidf = vec.fit_transform(beerlist).toarray()
beer_tfidf = vec.transform(['heinekin lager']).toarray()
rec_idx = cdist(beer_tfidf, beerlist_tfidf, 'cosine').argsort()
print(beerlist[rec_idx[0][1:]])

#['heinekin ale' 'corona lager' 'budweiser lager']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用 TF-IDF 和余弦相似度匹配短语 - Matching phrase using TF-IDF and cosine similarity 在 Python 中使用 TF-IDF、NGrams 和 Cosine Similarity 进行字符串匹配 - String Matching Using TF-IDF, NGrams and Cosine Similarity in Python Scipy,TF-IDF和余弦相似度 - Scipy, tf-idf and cosine similarity 使用sklearn如何计算文档和查询之间的tf-idf余弦相似度? - Using sklearn how do I calculate the tf-idf cosine similarity between documents and a query? 使用 tf-idf 的文档之间的余弦相似度和 TS-SS 相似度 - Python - Cosine Similarity and TS-SS similarity among documents using tf-idf - Python 归一化基于tf-idf计算的余弦相似度值 - Normalize cosine similarity values calculated based on tf-idf TF-IDF 和余弦相似度的模糊匹配不够准确 - Fuzzy matching not accurate enough with TF-IDF and cosine similarity 如何使用TF-IDF或LSA与gensim计算单词相似度? - How to compute word similarity using TF-IDF or LSA with gensim? (TF-IDF)计算余弦相似度后如何返回五篇相关文章 - (TF-IDF)How to return the five related article after calculating cosine similarity 在Python中简单实现N-Gram,tf-idf和余弦相似性 - Simple implementation of N-Gram, tf-idf and Cosine similarity in Python
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM