简体   繁体   English

python中的内存错误TFIDF余弦相似度

[英]Memory Error TFIDF cosine similarity in python

There's a large dataset with items descriptions. 有一个包含项目描述的大型数据集。 It contains item ID's and text description of it. 它包含商品ID和其文字说明。 One can build a cosine similarity matrix for tf_idf values for terms in descriptions. 可以为描述中的术语的tf_idf值建立一个余弦相似度矩阵。

My dataset contains descriptions for 300336 items. 我的数据集包含300336个项目的描述。 I've got a MemmoryError when try to execute my python code: 尝试执行python代码时出现MemmoryError错误:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import *

tf = TfidfVectorizer(analyzer='word',
                         ngram_range=(1, 1),
                         min_df=0)
tfidf_mx = tf.fit_transform(df.text)
cosine_similarities = linear_kernel(tfidf_mx)

I've tried also another way 我也尝试过另一种方式

sim_mx = cosine_similarity(tfidf_mx, dense_output=False)

But it gives me a MemoryError too. 但这也给了我一个MemoryError。

May be there's upper limit even on sparse matrix for cosine similarities computation? 甚至在稀疏矩阵上进行余弦相似度计算都可能有上限?

Do you know why MemoryError occurs and how to treat it? 你知道为什么出现MemoryError以及如何处理吗?

The MemoryError occurs because the output is (a) ridiculously large and (b) dense, regardless of whether it's stored in a sparse or dense matrix. 之所以会发生MemoryError,是因为无论输出是存储在稀疏矩阵还是密集矩阵中,输出(a)都非常大,并且(b)密集。

(a) If the input contains n items, there are n * (n - 1) similarities to compute and return. (a)如果输入包含n个项目,则存在计算和返回的n *(n-1)个相似点。 (Since sim(i, j) = sim(j, i), there are really just n * (n - 1) / 2 similarities, but the matrix lists each one twice.) With 300336 items, the resulting matrix will contain 90 billion entries. (因为sim(i,j)= sim(j,i),所以实际上只有n *(n-1)/ 2个相似度,但是矩阵将每个相似度列出两次。)如果有300336个项,则所得矩阵将包含90个十亿个条目。 That's about 720 G of space, I believe. 我相信,这大约是720 G的空间。

(b) If most of these entries were 0, then a sparse matrix would save space. (b)如果这些条目中大多数都是0,那么稀疏矩阵将节省空间。 But often that's not the case with similarity scores. 但是,相似性得分通常并非如此。 Cosine(i,j) will be 0, for example, only for pairs of items that have 0 words in common. 例如,仅对于共有0个单词的项对,余弦(i,j)将为0。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM