简体   繁体   English

我无法获得我的 tf-idf 矩阵的余弦相似度,因为 Google Colaboratory 给了我一个内存 RAM 错误

[英]I can't obtain the cosine similarity of my tf-idf matrix because Google Colaboratory gives me a memory RAM error

Could anyone help me please?有人可以帮我吗? I've been trying for days to solve a problem that is concerned about the workspace Google Colaboratory, an error in the available RAM on it and the calculus of the similarity punctuations of a TF-IDF matrix by means of the cosine similarity and other different metrics.几天来,我一直在尝试解决与 Google Colaboratory 工作区有关的问题、可用 RAM 中的错误以及通过余弦相似度和其他不同的 TF-IDF 矩阵的相似标点符号的演算指标。 I can't fix it and this should be really important for me because it is a piece of code that I must program for my TFG, which is about a content-based recommendation system that gives you the 10 most similar films of one by using the description of the films (here is the reason why I have to use the TD-IDF matrix, to convert that text or "document" for any film in the BBDD in a numeric format that my machine learning algorithm could understand).我无法修复它,这对我来说真的很重要,因为它是我必须为我的 TFG 编写的一段代码,这是一个基于内容的推荐系统,它通过使用为您提供 10 部最相似的电影电影的描述(这就是为什么我必须使用 TD-IDF 矩阵,将 BBDD 中任何电影的文本或“文档”转换为我的机器学习算法可以理解的数字格式)。

Firstly, I load the .csv file in which we can find the films with the descriptions of all of them in the column 'overview' of the pandas DataFrame (I let you the code for simplicity):首先,我加载了 .csv 文件,我们可以在其中找到包含所有电影描述的电影,这些电影包含在 pandas DataFrame 的“概述”列中(为简单起见,我让您使用代码):

import pandas as pd
metadata = pd.read_csv("movies_metadata.csv", low_memory = False)
metadata.loc[:, ["original_title","overview"]].head()

Then, I want to obtain the TF-IDF matrix of that film's descriptions in 'overview'.然后,我想在“概述”中获得该电影描述的 TF-IDF 矩阵。 To do so, I implement this piece of code:为此,我实现了这段代码:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

tfidf = TfidfVectorizer(stop_words = "English")
metadata["overview"] = metadata["overview"].fillna("")
tfidf_matrix = tfidf.fit_transform(metadata["overview"])
tfidf_matrix.shape

As you can see by yourself, the matrix (a sparse matrix) is obtained and it is a huge one!如您所见,得到了矩阵(一个稀疏矩阵),而且是一个巨大的矩阵! Its dimensions are 45466 rows and 75827 columns.它的维度是 45466 行和 75827 列。 And I think the reason of my problem is justly this.我认为我的问题的原因就是这个。

When I want to compute the pair cosine similarities punctuations between films, I should obtain a (45466,45466) matrix;当我想计算电影之间的对余弦相似度标点符号时,我应该得到一个(45466,45466)矩阵; but instead, Google Colaboratory gives me an error which says something like "your session has failed because you have used all the available RAM memory" when I run the following code:但是,当我运行以下代码时,Google Colaboratory 给了我一个错误,上面写着“您的会话失败,因为您已经使用了所有可用的 RAM 内存”:

from sklearn.metrics.pairwise import linear_kernel
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

I don't know how to solve it... But what I know is that I really need to obtain that cosine similarities between film for my study.我不知道如何解决它......但我知道的是,我真的需要获得电影之间的余弦相似度以供我学习。

In the datacamp tutorial that I've been using for learning about this machine learning topic enter link description here在我一直用于学习此机器学习主题的数据营教程中, 在此处输入链接描述

you could see that they obtain the cosine_sim matrix without problem... Why cannot be the same for me?你可以看到他们毫无问题地获得了 cosine_sim 矩阵......为什么对我来说不能一样?

I've been trying with other lines of code like:我一直在尝试使用其他代码行,例如:

from sklearn.metrics.pairwise import pairwise_distances as pd
cosine_sim = pd(tfidf_matrix, tfidf_matrix, metric = "cosine")

and even, I tried to enlarge the RAM of Google Colab with a code that I found on Internet:甚至,我尝试使用在 Internet 上找到的代码来扩大 Google Colab 的 RAM:

a = []
while(1):
a.append('1')

but it doesn't go right.但它并不正确。 I hope that anyone could give me a solution...我希望任何人都可以给我一个解决方案...

Thank you very much for your attention and sorry for the inconvenience!非常感谢您的关注,给您带来的不便深表歉意!

Though it's too late, but might be helpful to someone else, as i land yesterday here.虽然为时已晚,但可能对其他人有所帮助,因为我昨天在这里着陆。

Issue This is because of a memory issue, it requires around 15gb+ memory that's not available in the system, Why in DataCamp tutorial everything goes smooth, they might using cloud / they have upgraded system.问题这是因为内存问题,它需要大约 15gb+ 内存,而系统中没有,为什么在DataCamp教程中一切顺利,他们可能使用云/他们已经升级了系统。 in normal system this issue may occur.在正常系统中可能会出现此问题。

Solution Reduce the size of data by following same tutorial just add this line to get Top N Rows and do your experiments.解决方案按照相同的教程减少数据的大小,只需添加此行以获得Top N Rows并进行实验。 This solution is for learning and understanding the concepts.此解决方案用于学习和理解概念。 otherwise, you have to upgrade the system or follow any other technique.否则,您必须升级系统或遵循任何其他技术。 add the following line of code in after reading the csv in在读取csv后添加以下代码行

N = 300 #you can add any number of rows here that works with your system's memory.
metadata = metadata.head(N)

That's all.就这样。 you will not see Memory error any more.您将不会再看到内存错误。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM