简体   繁体   English

Sklearn:如何在大数据集上应用降维?

[英]Sklearn: How to apply dimensionality reduction on huge data set?

Problem: OutOfMemory error is showing on applying the PCA on 8 million features. 问题:在800万个功能上应用PCA时出现OutOfMemory错误。

Here is my code snipet:- 这是我的代码snipet: -

from sklearn.decomposition import PCA as sklearnPCA
sklearn_pca = sklearnPCA(n_components=10000)
pca_tfidf_sklearn = sklearn_pca.fit(traindata_tfidf.toarray())

I want to apply the PCA / dimension reduction techniques on text extracted features (using tf-idf). 我想在文本提取功能上应用PCA /降维技术(使用tf-idf)。 Currently I am having around 8 million such feature and I want to reduce those features and to classify the documents I am using the MultiNomialNB. 目前我有大约800万这样的功能,我想减少这些功能,并对我正在使用MultiNomialNB的文档进行分类。

And I am stucked due to the OutOfMemory error. 由于OutOfMemory错误,我被困住了。

I have had a similar problem. 我遇到过类似的问题。 Using a Restricted Boltzmann Machine (RBM) instead of PCA fixed the problem. 使用受限制的玻尔兹曼机(RBM)代替PCA解决了这个问题。 Mathematically, this is because PCA only looks at the EigenValues and EigenVectors of your feature matrix whereas RBM works as a neural network to consider all multiplicative possibilities of the features in your data. 在数学上,这是因为PCA只查看特征矩阵的EigenValues和EigenVectors,而RBM作为神经网络来考虑数据中所有特征的乘法可能性。 Therefore, RBM has a much greater set to consider when deciding which features are more important. 因此,在决定哪些特征更重要时,RBM有更大的考虑因素。 It then reduces the quantity of features to a much smaller size with more important features than PCA can. 然后,它将功能的数量减少到更小的尺寸,具有比PCA更重要的功能。 However, be sure to Feature Scale and Normalize the data before applying an RBM to the data. 但是,在将RBM应用于数据之前,请务必对功能进行比例缩放和规范化。

I suppose, traindata_tfidf is actually in a sparse form. 我想, traindata_tfidf实际上是一种稀疏的形式。 Try using one of scipy sparse formats instead of an array. 尝试使用scipy稀疏格式之一而不是数组。 Also take a look at SparsePCA methods, and if it doesn't help, use MiniBatchSparsePCA . 另请参阅SparsePCA方法,如果没有帮助,请使用MiniBatchSparsePCA

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM