简体   繁体   English

检查 pandas dataframe 中文本的相似性

[英]Check similarity of texts in pandas dataframe

I have a dataframe我有一个 dataframe

Account      Message
454232     Hi, first example 1
321342     Now, second example
412295     hello, a new example 1 in the third row
432325     And now something completely different

I would like to check similarity between texts in Message column.我想检查消息列中文本之间的相似性。 I would need to choose one of the message as source to test (for example the first one) and create a new column with the output from similarity test.我需要选择一条消息作为要测试的源(例如第一个消息)并使用相似性测试中的 output 创建一个新列。 If I had two lists, I would do as follows如果我有两个列表,我会这样做

import spacy
spacyModel = spacy.load('en')

list1 = ["Hi, first example 1"]
list2 = ["Now, second example","hello, a new example 1 in the third row","And now something completely different"]

list1SpacyDocs = [spacyModel(x) for x in list1]
list2SpacyDocs = [spacyModel(x) for x in list2]

similarityMatrix = [[x.similarity(y) for x in list1SpacyDocs] for y in list2SpacyDocs]

print(similarityMatrix)

But I do not know how to do the same in pandas, creating a new column with similarity results.但我不知道如何在 pandas 中做同样的事情,创建一个具有相似结果的新列。

Any suggestions?有什么建议么?

I am not sure about spacy , but in order to compare the one text with other values in the columns I would use .apply() and pass the match making function and set axis=1 for column-wise.我不确定spacy ,但为了将一个文本与列中的其他值进行比较,我将使用.apply()并传递匹配 function 并设置axis=1为按列。 Here is an example using SequenceMatcher (I don't have spacy for now).这是一个使用SequenceMatcher的示例(我现在没有spacy )。

test = 'Hi, first example 1'
df['r'] = df.apply(lambda x: SequenceMatcher(None, test, x.Message).ratio(), axis=1)
print(df)

Result:结果:

   Account                                  Message         r
0   454232                      Hi, first example 1  1.000000
1   321342                      Now, second example  0.578947
2   412295  hello, a new example 1 in the third row  0.413793
3   432325   And now something completely different  0.245614

So in your case, it will be a similar statement but using functions you have instead of SequenceMatcher因此,在您的情况下,这将是一个类似的语句,但使用您拥有的函数而不是 SequenceMatcher

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM