简体   繁体   English

DataFrame 中两列的余弦相似度

[英]Cosine similarity of two columns in a DataFrame

I've a dataframe with 2 columns and I am tring to get a cosine similarity score of each pair of sentences.我有一个 dataframe 有 2 列,我想得到每对句子的余弦相似度分数。

Dataframe (df) Dataframe (df)

       A                   B
0    Lorem ipsum ta      lorem ipsum
1    Excepteur sint      occaecat excepteur
2    Duis aute irure     aute irure 

some of the code pieces that I've tried are:我尝试过的一些代码片段是:

1. df["cosine_sim"] = df[["A","B"]].apply(lambda x1,x2:cosine_sim(x1,x2))

2. from spicy.spatial.distance import cosine
df["cosine_sim"] = df.apply(lambda row: 1 - cosine(row['A'], row['B']), axis = 1)

The above codes didn't work, and I am still trying different approaches but in the meanwhile I would appreciate any guidance, Thank you in advance!上面的代码不起作用,我仍在尝试不同的方法,但与此同时,我将不胜感激任何指导,在此先感谢您!

Desired output:所需的 output:

       A                   B                     cosine_sim
0    Lorem ipsum ta      lorem ipsum                 0.8
1    Excepteur sint      occaecat excepteur          0.5
2    Duis aute irure     aute irure                  0.4

You need to first convert your sentences into a vector, this process is referred to as text vectorization .你需要先将你的句子转换成一个向量,这个过程被称为文本向量化 There are many ways to perform text vectorization depending on the level of sophistication you require, what your corpus looks like, and the intended application.根据您需要的复杂程度、您的语料库的外观以及预期的应用,有多种方法可以执行文本矢量化。 The simplest is the "Bag of Words" (BoW) which I've implemented below.最简单的是我在下面实现的“词袋”(BoW)。 Once you have an understanding of what it means to represent a sentence as a vector, you can progress to other more complex methods of representing lexical similarity.一旦您理解了将句子表示为向量的含义,您就可以继续使用其他更复杂的表示词汇相似性的方法。 For example:例如:

  • tf-idf which weights certain words based on how frequently they occur across many documents (or sentences in your case). tf-idf根据某些单词在许多文档(或您的案例中的句子)中出现的频率对某些单词进行加权。 You can think of this as a weighted BoW approach.您可以将其视为加权 BoW 方法。
  • BM25 which fixes a shortcoming of tf-idf in which single mentions of words in a short documents produce high "relevance" scores. BM25修复了 tf-idf 的一个缺点,即在短文档中单次提及单词会产生高“相关性”分数。 It does this by taking into account the length of the document.它通过考虑文档的长度来做到这一点。

Advancing to measures of semantic similarity you can employ methods such as Doc2Vec [ 1 ] which start to use "embedding spaces" to represent the semantics of text.推进到语义相似性的度量,您可以使用Doc2Vec [ 1 ] 等方法,它开始使用“嵌入空间”来表示文本的语义。 Finally, the recent methods like SentenceBERT [ 2 ] and Dense Passage Retrieval [ 3 ] employ techniques based on the Transformer (encoder-decoder) architecture [ 4 ] to allow for "context aware" representations to be formed.最后, SentenceBERT [ 2 ] 和Dense Passage Retrieval [ 3 ] 等最近的方法采用基于Transformer (编码器-解码器)架构 [ 4 ] 的技术,以允许形成“上下文感知”表示。

Solution解决方案

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
from numpy.linalg import norm

df = pd.DataFrame({
    "A": [
    "I'm not a party animal, but I do like animal parties.",
    "That must be the tenth time I've been arrested for selling deep-fried cigars.",
    "He played the game as if his life depended on it and the truth was that it did."
    ],
    "B": [
    "The mysterious diary records the voice.",
    "She had the gift of being able to paint songs.",
    "The external scars tell only part of the story."
    ]
    })

# Combine all to make single corpus of text (i.e. list of sentences)
corpus = pd.concat([df["A"], df["B"]], axis=0, ignore_index=True).to_list()
# print(corpus)  # Display list of sentences

# Vectorization using basic Bag of Words (BoW) approach
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
# print(vectorizer.get_feature_names_out())  # Display features
vect_sents = X.toarray()

cosine_sim_scores = []
# Iterate over each vectorised sentence in the A-B pairs from the original dataframe
for A_vect, B_vect in zip(vect_sents, vect_sents[int(len(vect_sents)/2):]):
    # Calculate cosine similarity and store result
    cosine_sim_scores.append(np.dot(A_vect, B_vect)/(norm(A_vect)*norm(B_vect)))
# Append results to original dataframe
df.insert(2, 'cosine_sim', cosine_sim_scores)
print(df)

Output Output

                                A                                         B  cosine_sim
0  I'm not a party animal, but...          The mysterious diary records ...    0.000000
1  That must be the tenth time...   She had the gift of being able to pa...    0.084515
2  He played the game as if hi...  The external scars tell only part of ...    0.257130

References参考

[ 1 ] Le, Q. and Mikolov, T., 2014, June. [ 1 ] Le, Q. 和 Mikolov, T.,2014 年 6 月。 Distributed representations of sentences and documents.句子和文档的分布式表示。 In International conference on machine learning (pp. 1188-1196).在机器学习国际会议上(第 1188-1196 页)。 PMLR. PMLR。

[ 2 ] Reimers, N. and Gurevych, I., 2019. Sentence-bert: Sentence embeddings using siamese bert.networks. [ 2 ] Reimers, N. 和 Gurevych, I.,2019。Sentence-bert:使用 siamese bert.networks 的句子嵌入。 arXiv preprint arXiv:1908.10084. arXiv 预印本 arXiv:1908.10084。

[ 3 ] Karpukhin, V., Oğuz, B., Min, S., Lewis, P., Wu, L., Edunov, S., Chen, D. and Yih, WT, 2020. Dense passage retrieval for open-domain question answering. [ 3 ] Karpukhin, V.、Oğuz, B.、Min, S.、Lewis, P.、Wu, L.、Edunov, S.、Chen, D. 和 Yih, WT,2020 年。域问答。 arXiv preprint arXiv:2004.04906. arXiv 预印本 arXiv:2004.04906。

[ 4 ] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, AN, Kaiser, Ł. [ 4 ] Vaswani, A.、Shazeer, N.、Parmar, N.、Uszkoreit, J.、Jones, L.、Gomez, AN、Kaiser, Ł。 and Polosukhin, I., 2017. Attention is all you need.和 Polosukhin, I., 2017。注意力就是你所需要的。 Advances in neural information processing systems, 30.神经信息处理系统的进展,30。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM