在python中找到最相似的句子

Question

Suggestions / refer links /codes are appreciated.建议/参考链接/代码表示赞赏。

I have a data which is having more than 1500 rows.我有一个超过 1500 行的数据。 Each row has a sentence.每一行都有一个句子。 I am trying to find out the best method to find the most similar sentences among all.我试图找出在所有句子中找到最相似句子的最佳方法。

What I have tried我试过的

I have tried K-mean algorithm which groups similar sentences in a cluster.我尝试过 K-mean 算法，它将相似的句子分组在一个集群中。 But I found a drawback in which I have to pass K to create a cluster.但是我发现了一个缺点，我必须通过K来创建一个集群。 It is hard to guess K .很难猜测K 。 I tried elbo method to guess the clusters but grouping all together isn't sufficient.我尝试了 elbo 方法来猜测集群，但将所有组合在一起是不够的。 In this approach I am getting all the data grouped.在这种方法中，我将所有数据分组。 I am looking for data which is similar above 0.90% data should be returned with ID.我正在寻找与 0.90% 以上的数据类似的数据，应返回 ID。
I tried cosine similarity in which I used TfidfVectorizer to create matrix and then passed in cosine similarity.我尝试了余弦相似度，其中我使用TfidfVectorizer创建矩阵，然后传入余弦相似度。 Even this approach didn't worked properly.即使这种方法也不能正常工作。

What I am looking for我在寻找什么

I want an approach where I can pass a threshold example 0.90 data in all rows which are similar to each other above 0.90% should be returned as a result.我想要一种方法，我可以在其中传递阈值示例 0.90 的所有行中的数据，这些数据应该作为结果返回。

Data Sample
ID    |   DESCRIPTION
-----------------------------
10    | Cancel ASN WMS Cancel ASN   
11    | MAXPREDO Validation is corect
12    | Move to QC  
13    | Cancel ASN WMS Cancel ASN   
14    | MAXPREDO Validation is right
15    | Verify files are sent every hours for this interface from Optima
16    | MAXPREDO Validation are correct
17    | Move to QC  
18    | Verify files are not sent

Expected result预期结果

Above data which are similar upto 0.90% should get as a result with ID上面的数据相似度高达 0.90% 应该得到带有ID的结果

ID    |   DESCRIPTION
-----------------------------
10    | Cancel ASN WMS Cancel ASN
13    | Cancel ASN WMS Cancel ASN
11    | MAXPREDO Validation is corect  # even spelling is not correct
14    | MAXPREDO Validation is right
16    | MAXPREDO Validation are correct
12    | Move to QC  
17    | Move to QC

Answer 1

Why did it not work for you with cosine similarity and the TFIDF-vectorizer?为什么它对余弦相似度和 TFIDF 向量化器不起作用？

I tried it and it works with this code:我试过了，它适用于以下代码：

import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

df = pd.DataFrame(columns=["ID","DESCRIPTION"], data=np.matrix([[10,"Cancel ASN WMS Cancel ASN"],
                                                                [11,"MAXPREDO Validation is corect"],
                                                                [12,"Move to QC"],
                                                                [13,"Cancel ASN WMS Cancel ASN"],
                                                                [14,"MAXPREDO Validation is right"],
                                                                [15,"Verify files are sent every hours for this interface from Optima"],
                                                                [16,"MAXPREDO Validation are correct"],
                                                                [17,"Move to QC"],
                                                                [18,"Verify files are not sent"]
                                                                ]))

corpus = list(df["DESCRIPTION"].values)

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)

threshold = 0.4

for x in range(0,X.shape[0]):
  for y in range(x,X.shape[0]):
    if(x!=y):
      if(cosine_similarity(X[x],X[y])>threshold):
        print(df["ID"][x],":",corpus[x])
        print(df["ID"][y],":",corpus[y])
        print("Cosine similarity:",cosine_similarity(X[x],X[y]))
        print()

The threshold can be adjusted as well, but will not yield the results you want with a threshold of 0.9.阈值也可以调整，但阈值为 0.9 时不会产生您想要的结果。

The output for a threshold of 0.4 is:阈值为 0.4 的输出为：

10 : Cancel ASN WMS Cancel ASN
13 : Cancel ASN WMS Cancel ASN
Cosine similarity: [[1.]]

11 : MAXPREDO Validation is corect
14 : MAXPREDO Validation is right
Cosine similarity: [[0.64183024]]

12 : Move to QC
17 : Move to QC
Cosine similarity: [[1.]]

15 : Verify files are sent every hours for this interface from Optima
18 : Verify files are not sent
Cosine similarity: [[0.44897995]]

With a threshold of 0.39 all your expected sentences are features in the output, but an additional pair with the indices [15,18] can be found as well:阈值为 0.39 时，所有预期的句子都是输出中的特征，但也可以找到带有索引 [15,18] 的附加对：

10 : Cancel ASN WMS Cancel ASN
13 : Cancel ASN WMS Cancel ASN
Cosine similarity: [[1.]]

11 : MAXPREDO Validation is corect
14 : MAXPREDO Validation is right
Cosine similarity: [[0.64183024]]

11 : MAXPREDO Validation is corect
16 : MAXPREDO Validation are correct
Cosine similarity: [[0.39895808]]

12 : Move to QC
17 : Move to QC
Cosine similarity: [[1.]]

14 : MAXPREDO Validation is right
16 : MAXPREDO Validation are correct
Cosine similarity: [[0.39895808]]

15 : Verify files are sent every hours for this interface from Optima
18 : Verify files are not sent
Cosine similarity: [[0.44897995]]

Answer 2

A possible way would be to use word-embeddings to create vector-representations of your sentences.一种可能的方法是使用词嵌入来创建句子的向量表示。 Like you use pretrained word-embeddings and let a rnn layer create a sentence vector-representation, where the word-embeddings of each sentence are combined.就像您使用预训练的词嵌入并让 rnn 层创建句子向量表示一样，其中组合了每个句子的词嵌入。 Then you have a vector, where you could calculate distances between.然后你有一个向量，你可以在其中计算之间的距离。 But you need to decide, which threshold you want to set, so a sentence is accepted as similar, since the scales of word-embeddings are not fixed.但是你需要决定，你想设置哪个阈值，所以一个句子被认为是相似的，因为词嵌入的尺度不是固定的。

Update更新

I did some experiments.我做了一些实验。 In my opinion, this is a viable method for such a task, however, you might want to find out for yourself, how well it is working in your case.在我看来，这是完成此类任务的可行方法，但是，您可能想亲自了解它在您的案例中的效果如何。 I created an example in my git repository .我在我的 git 存储库中创建了一个示例。

Also the word-mover-distance algorithm can be used for this task.词移动距离算法也可用于此任务。 You can find more information about this topic in this medium article .您可以在这篇中等文章中找到有关此主题的更多信息。

Answer 3

One can use this Python 3 library to compute sentence similarity: https://github.com/UKPLab/sentence-transformers可以使用这个 Python 3 库来计算句子相似度： https : //github.com/UKPLab/sentence-transformers

Code example from https://www.sbert.net/docs/usage/semantic_textual_similarity.html :来自https://www.sbert.net/docs/usage/semantic_textual_similarity.html 的代码示例：

from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('paraphrase-MiniLM-L12-v2')

# Two lists of sentences
sentences1 = ['The cat sits outside',
             'A man is playing guitar',
             'The new movie is awesome']

sentences2 = ['The dog plays in the garden',
              'A woman watches TV',
              'The new movie is so great']

#Compute embedding for both lists
embeddings1 = model.encode(sentences1, convert_to_tensor=True)
embeddings2 = model.encode(sentences2, convert_to_tensor=True)

#Compute cosine-similarits
cosine_scores = util.pytorch_cos_sim(embeddings1, embeddings2)

#Output the pairs with their score
for i in range(len(sentences1)):
    print("{} \t\t {} \t\t Score: {:.4f}".format(sentences1[i], sentences2[i], cosine_scores[i][i]))

The library contains the state-of-the-art sentence embedding models.该库包含最先进的句子嵌入模型。

See https://stackoverflow.com/a/68728666/395857 to perform sentence clustering.请参阅https://stackoverflow.com/a/68728666/395857以执行句子聚类。

在python中找到最相似的句子

问题描述

3 个解决方案

解决方案1
5 已采纳 2020-09-03 07:51:43

解决方案2
3 2020-09-03 07:15:53

解决方案3
0 2021-08-10 14:33:15

在python中找到最相似的句子

问题描述

3 个解决方案

解决方案1 5 已采纳 2020-09-03 07:51:43

解决方案2 3 2020-09-03 07:15:53

解决方案3 0 2021-08-10 14:33:15

解决方案1
5 已采纳 2020-09-03 07:51:43

解决方案2
3 2020-09-03 07:15:53

解决方案3
0 2021-08-10 14:33:15