CountVectorizer（）。适合scikit-learn Python给出内存错误

Question

I am working on an 8 class classification problem, Training set contains around 400,000 labeled entities, I am using CountVectorizer.fit() to vectorize the data, but I am getting a Memory error, I tried using HashingVectorizer instead, but in vain. 我正在研究一个8类分类问题，训练集包含大约400,000个标记实体，我使用CountVectorizer.fit（）来矢量化数据，但我收到了内存错误，我尝试使用HashingVectorizer，但是徒劳无功。

path = 'data/products.tsv' products = pd.read_table(path , header= None , names = ['label' , 'entry'])   
X = products.entry
y = products.label
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

# Vectorizing the Dataset
vect = CountVectorizer()
vect.fit(X_train.values.astype('U'))
X_train_dtm = vect.transform(X_train)
X_test_dtm = vect.transform(X_test)

Answer 1

You can set max_features which limit the memory usage of your vocabulary. 您可以设置max_features ，以限制词汇表的内存使用量。 The right value really depends on the task so you should treat it as a hyper parameter and try and tune it. 正确的值实际上取决于任务，因此您应将其视为超参数并尝试对其进行调整。 In NLP (english) people usually use ~10,000 as a vocabulary size. 在NLP（英语）中，人们通常使用~10,000作为词汇量。 You can also do the same with HashVectorizer but you risk hash collusions which will cause multiple words to increase the same counter. 您也可以使用HashVectorizer执行相同操作，但是您冒着哈希共谋的风险，这会导致多个单词增加相同的计数器。

path = 'data/products.tsv' products = pd.read_table(path , header= None , names = ['label' , 'entry'])   
X = products.entry
y = products.label
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

# Vectorizing the Dataset
vect = CountVectorizer(max_features=10000)
vect.fit(X_train.values.astype('U'))
X_train_dtm = vect.transform(X_train)
X_test_dtm = vect.transform(X_test)

CountVectorizer（）。适合scikit-learn Python给出内存错误

问题描述

1 个解决方案

解决方案1
1 2018-02-17 06:59:55

CountVectorizer（）。适合scikit-learn Python给出内存错误

问题描述

1 个解决方案

解决方案1 1 2018-02-17 06:59:55

解决方案1
1 2018-02-17 06:59:55