简体   繁体   English

使用scikit-learn在项目描述之间的余弦相似度

[英]Cosine similarity between item descriptions using scikit-learn

I am using python 2.7 and scikit-learn to find cosine similarity between item descriptions. 我正在使用python 2.7和scikit-learn查找项目描述之间的余弦相似度。

A have a df , for example: A具有df ,例如:

items    description

1fgg     abcd ty
2hhj     abc r 
3jkl     r df

I did following procedures: 我做了以下步骤:

1) tokenizing and stemming each description 1)标记并阻止每个description

2) transform the corpus into vector space using tf-idf 2)使用tf-idf将语料库转换为向量空间

3) calculated cosine distance between each description text as a measure of similarity. 3)计算每个描述文字之间的cosine distance ,作为相似度的度量。 distance = 1 - cosinesimilarity(tfidf_matrix)

My goal is to have a similarity matrix of items like this and answer the question like: "What is the similarity between the items 1ffg and 2hhj : 我的目标是拥有类似items的相似度矩阵,并回答以下问题:“项目1ffg2hhj之间的相似性是什么:

        1fgg    2hhj    3jkl
1ffg    1.0     0.8     0.1
2hhj    0.8     1.0     0.0
3jkl    0.1     0.0     1.0 

How to get this result? 如何得到这个结果? Thank you for your time. 感谢您的时间。

You can use numpy array to create the matrix and then add index and head to create a dataframe. 您可以使用numpy数组创建矩阵,然后添加索引和标头以创建数据框。

Assume you have a list of descriptions: descriptions = ['abc', 'bcd', 'etc' ...] and corresponding tf-idf matrix. 假设您有一个描述列表: descriptions = ['abc', 'bcd', 'etc' ...]和相应的tf-idf矩阵。 (row number corresponds description number) (行号对应于描述号)

You want to create an empty numpy array of shape NxN, where N = len(words) 您想创建一个形状为NxN的空numpy数组,其中N = len(words)

distance_matrix = np.zeros((N,N))

Then you need to fill it with actual distances: 然后,您需要用实际距离填充它:

for i in range(N):
    for j in range(N):
        distance_matrix[i,j] = cosine_distance(tf_idf[i,:], tf_idf[j,:])

You can create dataframe with 您可以使用创建数据框

pandas.DataFrame(distance_matrix, index = items_list, columns = items_list)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM