快速计算 dataframe 中所有案例之间的余弦相似度

Question

I'm working on an NLP project where I have to compare the similarity between many sentences EG from this dataframe:我正在做一个 NLP 项目，我必须比较这个 dataframe 中的许多句子 EG 之间的相似性：

The first thing I tried was to make a join of the dataframe with itself to get the bellow format and compare row by row:我尝试的第一件事是将 dataframe 与自身连接以获得以下格式并逐行比较：

The problem with this that I get out of memory quickly for big medium/big datasets, eg for a 10k rows join I will get 100MM rows which I can not fit in ram对于大中/大数据集，我快速摆脱 memory 的问题，例如对于 10k 行连接，我将获得 100MM 行，我无法放入 ram

My current aproach is to iterate over the dataframe with as:我目前的方法是迭代 dataframe 为：

final = pd.DataFrame()

### for each row 
for i in range(len(df_sample)):

    ### select the corresponding vector to compare with 
    v =  df_sample[df_sample.index.isin([i])]["use_vector"].values
    ### compare all cases agains the selected vector
    df_sample.apply(lambda x:  cosine_similarity_numba(x.use_vector,v[0])  ,axis=1)

    ### kept the cases with a similarity over a given th, in this case 0.6
    temp = df_sample[df_sample.apply(lambda x:  cosine_similarity_numba(x.use_vector,v[0])  ,axis=1) > 0.6]  
    ###  filter out the base case 
    temp = temp[~temp.index.isin([i])]
    temp["original_question"] = copy.copy(df_sample[df_sample.index.isin([i])]["questions"].values[0])
    ### append the result     
    final = pd.concat([final,temp])

But this aproach is not fast either.但这种方法也不快。 How can I improve the performance of this process?我怎样才能提高这个过程的性能？

Answer 1

One possible trick you may employ is to switch from sparse tfidf representation to dense word embeddings from Facebook's fasttext :您可能会采用的一种技巧是从 Facebook 的fasttext中从稀疏 tfidf 表示切换到密集词嵌入：

import fasttext
# wget https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.bin.gz
model = fasttext.load_model("./cc.en.300.bin")

Then you can proceed to calculate cosine similarity with more space efficient, context aware and better performing (?) dense word embeddings:然后，您可以使用更节省空间、上下文感知和性能更好（？）的密集词嵌入来继续计算余弦相似度：

df = pd.DataFrame({"questions":["This is a question",
                                "This is a similar questin",
                                "And this one is absolutely different"]})

df["vecs"] = df["questions"].apply(model.get_sentence_vector)

from scipy.spatial.distance import pdist, squareform
# only pairwise distance with itself
# vectorized, no doubling data
out = pdist(np.stack(df['vecs']), metric="cosine")
cosine_similarity = squareform(out)
print(cosine_similarity)

[[0.         0.08294727 0.25305626]
 [0.08294727 0.         0.23575631]
 [0.25305626 0.23575631 0.        ]]

Note as well, on top of memory efficiency, you also gain about 10x speed increase due to using cosine similarity from scipy .还要注意，除了 memory 效率之外，由于使用scipy的余弦相似性，您还可以获得大约10 倍的速度提升。

Another possible trick is to cast your similarity vectors from default float64 to float32 or float16 :另一个可能的技巧是将相似向量从默认的float64转换为float32或float16 ：

df["vecs"] = df["vecs"].apply(np.float16)

which will give you both speed and memory gains.这将为您提供速度和 memory 增益。

Answer 2

I just wrote an answer to a problem similar to yours yesterday, which is Top-K Cosine Similarity rows in a dataframe of pandas我昨天刚刚写了一个与您类似的问题的答案，即pandas 的 dataframe 中的 Top-K 余弦相似度行

import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity

data = {"use_vector": [[-0.1, -0.2, 0.3], [0.1, -0.2, -0.3], [-0.1, 0.2, -0.3]]}
df = pd.DataFrame(data)
print("Data: \n{}\n".format(df))

vectors = []
for v in df['use_vector']:
    vectors.append(v)
vectors_num = len(vectors)
A=np.array(vectors)
# Get similarities matrix, value for each pair at corresponding index of upper triangle of matrix
similarities = cosine_similarity(A)
# Set symmetrical(repetitive) and diagonal(similarity to self) to -2
similarities[np.tril_indices(vectors_num)] = -2
print("Similarities: \n{}\n".format(similarities))

Outputs:输出：

Data: 
          use_vector
0  [-0.1, -0.2, 0.3]
1  [0.1, -0.2, -0.3]
2  [-0.1, 0.2, -0.3]

Similarities:
[[-2.         -0.42857143 -0.85714286]  # vector 0 & 1, 2
 [-2.         -2.          0.28571429]  # vector 1 & 2
 [-2.         -2.         -2.        ]]

快速计算 dataframe 中所有案例之间的余弦相似度

问题描述

2 个解决方案

解决方案1
5 2020-12-28 20:17:00

解决方案2
2 2020-12-24 02:56:32

快速计算 dataframe 中所有案例之间的余弦相似度

问题描述

2 个解决方案

解决方案1 5 2020-12-28 20:17:00

解决方案2 2 2020-12-24 02:56:32

解决方案1
5 2020-12-28 20:17:00

解决方案2
2 2020-12-24 02:56:32