简体   繁体   中英

I can't obtain the cosine similarity of my tf-idf matrix because Google Colaboratory gives me a memory RAM error

Could anyone help me please? I've been trying for days to solve a problem that is concerned about the workspace Google Colaboratory, an error in the available RAM on it and the calculus of the similarity punctuations of a TF-IDF matrix by means of the cosine similarity and other different metrics. I can't fix it and this should be really important for me because it is a piece of code that I must program for my TFG, which is about a content-based recommendation system that gives you the 10 most similar films of one by using the description of the films (here is the reason why I have to use the TD-IDF matrix, to convert that text or "document" for any film in the BBDD in a numeric format that my machine learning algorithm could understand).

Firstly, I load the .csv file in which we can find the films with the descriptions of all of them in the column 'overview' of the pandas DataFrame (I let you the code for simplicity):

import pandas as pd
metadata = pd.read_csv("movies_metadata.csv", low_memory = False)
metadata.loc[:, ["original_title","overview"]].head()

Then, I want to obtain the TF-IDF matrix of that film's descriptions in 'overview'. To do so, I implement this piece of code:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

tfidf = TfidfVectorizer(stop_words = "English")
metadata["overview"] = metadata["overview"].fillna("")
tfidf_matrix = tfidf.fit_transform(metadata["overview"])
tfidf_matrix.shape

As you can see by yourself, the matrix (a sparse matrix) is obtained and it is a huge one! Its dimensions are 45466 rows and 75827 columns. And I think the reason of my problem is justly this.

When I want to compute the pair cosine similarities punctuations between films, I should obtain a (45466,45466) matrix; but instead, Google Colaboratory gives me an error which says something like "your session has failed because you have used all the available RAM memory" when I run the following code:

from sklearn.metrics.pairwise import linear_kernel
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

I don't know how to solve it... But what I know is that I really need to obtain that cosine similarities between film for my study.

In the datacamp tutorial that I've been using for learning about this machine learning topic enter link description here

you could see that they obtain the cosine_sim matrix without problem... Why cannot be the same for me?

I've been trying with other lines of code like:

from sklearn.metrics.pairwise import pairwise_distances as pd
cosine_sim = pd(tfidf_matrix, tfidf_matrix, metric = "cosine")

and even, I tried to enlarge the RAM of Google Colab with a code that I found on Internet:

a = []
while(1):
a.append('1')

but it doesn't go right. I hope that anyone could give me a solution...

Thank you very much for your attention and sorry for the inconvenience!

Though it's too late, but might be helpful to someone else, as i land yesterday here.

Issue This is because of a memory issue, it requires around 15gb+ memory that's not available in the system, Why in DataCamp tutorial everything goes smooth, they might using cloud / they have upgraded system. in normal system this issue may occur.

Solution Reduce the size of data by following same tutorial just add this line to get Top N Rows and do your experiments. This solution is for learning and understanding the concepts. otherwise, you have to upgrade the system or follow any other technique. add the following line of code in after reading the csv in

N = 300 #you can add any number of rows here that works with your system's memory.
metadata = metadata.head(N)

That's all. you will not see Memory error any more.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM