Python：在Pandas中計算兩列之間的tf-idf余弦相似性時的MemoryError

Question

我正在嘗試計算Pandas數據幀中兩列之間的tf-idf向量余弦相似度。 一列包含搜索查詢，另一列包含產品標題。 余弦相似度值旨在成為搜索引擎/排名機器學習算法的“特征”。

我在iPython筆記本中這樣做，不幸的是遇到了MemoryErrors，並且在經過幾個小時的挖掘后我不確定為什么。

我的設置：

聯想E560筆記本電腦
Core i7-6500U @ 2.50 GHz
16 GB Ram
Windows 10
使用anaconda 3.5內核以及所有庫的全新更新

我根據類似的stackoverflow問題在小玩具數據集上測試了我的代碼/目標：

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy import spatial

clf = TfidfVectorizer()

a = ['hello world', 'my name is', 'what is your name?', 'max cosine sim']
b = ['my name is', 'hello world', 'my name is what?', 'max cosine sim']

df = pd.DataFrame(data={'a':a, 'b':b})

clf.fit(df['a'] + " " + df['b'])

tfidf_a = clf.transform(df['a']).todense()
tfidf_b = clf.transform(df['b']).todense()

row_similarities = [1 - spatial.distance.cosine(tfidf_a[x],tfidf_b[x]) for x in range(len(tfidf_a)) ]

df['tfidf_cosine_similarity'] = row_similarities

print(df)

這給出了以下（好！）輸出：

                   a                 b  tfidf_cosine_similarity
0         hello world        my name is                 0.000000
1          my name is       hello world                 0.000000
2  what is your name?  my name is what?                 0.725628
3      max cosine sim    max cosine sim                 1.000000

但是，當我嘗試將相同的方法應用於維度為186,154 x 5的數據框（df_all_export）時（查詢（search_term）和文檔（product_title）的5列中的2列如此：

clf.fit(df_all_export['search_term'] + " " + df_all_export['product_title'])

tfidf_a = clf.transform(df_all_export['search_term']).todense()
tfidf_b = clf.transform(df_all_export['product_title']).todense()

row_similarities = [1 - spatial.distance.cosine(tfidf_a[x],tfidf_b[x]) for x in range(len(tfidf_a)) ]
df_all_export['tfidf_cosine_similarity'] = row_similarities

df_all_export.head()

我得到了......（這里沒有給出完整的錯誤，但是你明白了）：

MemoryError                               Traceback (most recent call last)
<ipython-input-27-8308fcfa8f9f> in <module>()
     12 clf.fit(df_all_export['search_term'] + " " + df_all_export['product_title'])
     13 
---> 14 tfidf_a = clf.transform(df_all_export['search_term']).todense()
     15 tfidf_b = clf.transform(df_all_export['product_title']).todense()
     16

絕對迷失在這一個，但我擔心解決方案將非常簡單和優雅:)

先感謝您！

Answer 1

您仍然可以使用sklearn.metrics.pairwise方法處理sparsed matrixes / arrays：

# I've executed your example up to (including):
# ...
clf.fit(df['a'] + " " + df['b'])

A = clf.transform(df['a'])

B = clf.transform(df['b'])

from sklearn.metrics.pairwise import *

paired_cosine_distances將顯示您的字符串有多遠或多少不同（比較兩列中的值“逐行”）

0 - 表示完全匹配

In [136]: paired_cosine_distances(A, B)
Out[136]: array([ 1.        ,  1.        ,  0.27437247,  0.        ])

cosine_similarity將比較列a第一個字符串和列b所有字符串（ 第1行 ）; 柱的第二串a與列中的所有串b （ 第2行 ）等等...

In [137]: cosine_similarity(A, B)
Out[137]:
array([[ 0.        ,  1.        ,  0.        ,  0.        ],
       [ 1.        ,  0.        ,  0.74162106,  0.        ],
       [ 0.43929881,  0.        ,  0.72562753,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  1.        ]])

In [141]: A
Out[141]:
<4x10 sparse matrix of type '<class 'numpy.float64'>'
        with 12 stored elements in Compressed Sparse Row format>

In [142]: B
Out[142]:
<4x10 sparse matrix of type '<class 'numpy.float64'>'
        with 12 stored elements in Compressed Sparse Row format>

注意：所有計算都是使用稀疏矩陣 - 我們沒有在內存中解壓縮它們！

Answer 2

通過上面MaxU發布的親切幫助和解決方案，我在這里展示了完成我試圖實現的任務的完整代碼。 除了MemoryError之外，當我嘗試一些“hacky”變通方法時，它還會避免在余弦相似度計算中出現奇怪的nans。

注意下面的代碼是一個部分片段，在這個意義上，已經在完整代碼中構造了尺寸為186,134 x 5的大數據幀df_all_export 。

我希望這有助於其他試圖在搜索查詢和匹配文檔之間使用tf-idf向量計算余弦相似度的人。 對於這樣一個常見的“問題”，我很難找到一個用SKLearn和Pandas實現的明確解決方案。

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import paired_cosine_distances as pcd

clf = TfidfVectorizer()

clf.fit(df_all_export['search_term'] + " " + df_all_export['product_title'])

A = clf.transform(df_all_export['search_term'])
B = clf.transform(df_all_export['product_title'])

cosine = 1 - pcd(A, B)

df_all_export['tfidf_cosine'] = cosine

Python：在Pandas中計算兩列之間的tf-idf余弦相似性時的MemoryError

問題描述

2 個解決方案

解決方案1
3 已采納 2017-03-23 10:19:28

解決方案2
1 2017-03-23 12:47:34

Python：在Pandas中計算兩列之間的tf-idf余弦相似性時的MemoryError

問題描述

2 個解決方案

解決方案1 3 已采納 2017-03-23 10:19:28

解決方案2 1 2017-03-23 12:47:34

解決方案1
3 已采納 2017-03-23 10:19:28

解決方案2
1 2017-03-23 12:47:34