简体   繁体   中英

Getting memory error while transforming spars matrix to array with column names. This array is input to training model

My training data consists of 5 million rows of product description having average length of 10 words. I can use either CountVectorizer or Tf-IDF to transform my input feature. However, post transforming the feature to a sparse matrix, while converting it to an array or dense array, I am constantly getting memory error. The count Vectorizer return ~130k column token. Below are the two methods I am trying to implement. Please note, the system I am working on has 512Gb of Memory. Below is the error:

return np.zeros(self.shape, dtype=self.dtype, order=order) . MemoryError

Method 1

from sklearn.feature_extraction.text import CountVectorizer

vect1 = CountVectorizer(ngram_range= (1,2), min_df = 20) 


train_dtm1 = vect1.fit_transform(train_data)

dtm_data = pd.DataFrame(train_dtm1.toarray(), columns=vect1.get_feature_names())

Method 2

tfidf  = TfidfVectorizer(stop_words='english',  ngram_range=(1, 2), max_df=0.5, min_df=20, use_idf=True)

corpus = tfidf.fit_transform(train_data)

dtm_data = pd.DataFrame(corpus_split.todense(), columns=tfidf.get_feature_names())

dtm_data goes into test-train split, which further goes into Keras ANN. How to resolve this memory issue?

Out of memory error happens when python is using more memory than available. Along with your system memory, look at your graphics card memory if you are using tensorflow-gpu. You might want to take a look at google colab, which runs the python program in the cloud.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM