使用scikit-learn在项目描述之间的余弦相似度

Question

I am using python 2.7 and scikit-learn to find cosine similarity between item descriptions. 我正在使用python 2.7和scikit-learn查找项目描述之间的余弦相似度。

A have a df , for example: A具有df ，例如：

items    description

1fgg     abcd ty
2hhj     abc r 
3jkl     r df

I did following procedures: 我做了以下步骤：

1) tokenizing and stemming each description 1）标记并阻止每个description

2) transform the corpus into vector space using tf-idf 2）使用tf-idf将语料库转换为向量空间

3) calculated cosine distance between each description text as a measure of similarity. 3）计算每个描述文字之间的cosine distance ，作为相似度的度量。 distance = 1 - cosinesimilarity(tfidf_matrix)

My goal is to have a similarity matrix of items like this and answer the question like: "What is the similarity between the items 1ffg and 2hhj : 我的目标是拥有类似items的相似度矩阵，并回答以下问题：“项目1ffg和2hhj之间的相似性是什么：

        1fgg    2hhj    3jkl
1ffg    1.0     0.8     0.1
2hhj    0.8     1.0     0.0
3jkl    0.1     0.0     1.0

How to get this result? 如何得到这个结果？ Thank you for your time. 感谢您的时间。

Answer 1

You can use numpy array to create the matrix and then add index and head to create a dataframe. 您可以使用numpy数组创建矩阵，然后添加索引和标头以创建数据框。

Assume you have a list of descriptions: descriptions = ['abc', 'bcd', 'etc' ...] and corresponding tf-idf matrix. 假设您有一个描述列表： descriptions = ['abc', 'bcd', 'etc' ...]和相应的tf-idf矩阵。 (row number corresponds description number) （行号对应于描述号）

You want to create an empty numpy array of shape NxN, where N = len(words) 您想创建一个形状为NxN的空numpy数组，其中N = len(words)

distance_matrix = np.zeros((N,N))

Then you need to fill it with actual distances: 然后，您需要用实际距离填充它：

for i in range(N):
    for j in range(N):
        distance_matrix[i,j] = cosine_distance(tf_idf[i,:], tf_idf[j,:])

You can create dataframe with 您可以使用创建数据框

pandas.DataFrame(distance_matrix, index = items_list, columns = items_list)

使用scikit-learn在项目描述之间的余弦相似度

问题描述

1 个解决方案

解决方案1
1 已采纳 2016-02-18 19:15:36

使用scikit-learn在项目描述之间的余弦相似度

问题描述

1 个解决方案

解决方案1 1 已采纳 2016-02-18 19:15:36

解决方案1
1 已采纳 2016-02-18 19:15:36