简体   繁体   English

使用Scikit-Learn和Gensim的SVD具有600万个功能

[英]SVD using Scikit-Learn and Gensim with 6 million features

I am trying to classify paragraphs based on their sentiments. 我试图根据他们的情绪对段落进行分类。 I have training data of 600 thousand documents. 我有60万份文件的培训数据。 When I convert them to Tf-Idf vector space with words as analyzer and ngram range as 1-2 there are almost 6 million features. 当我将它们转换为Tf-Idf向量空间时,单词作为分析器,ngram范围为1-2,有近600万个特征。 So I have to do Singular value decomposition (SVD) to reduce features. 所以我必须做奇异值分解(SVD)来减少特征。

I have tried gensim and sklearn's SVD feature. 我尝试过gensim和sklearn的SVD功能。 Both work fine for feature reduction till 100 but as soon as I try for 200 features they throw memory error. 两者都可以正常工作,直到100减少功能,但是当我尝试200个功能时,它们会引发内存错误。

Also I have not used entire document (600 thousand) as training data, I have taken 50000 documents only. 另外我还没有使用整个文档(60万)作为训练数据,我只采用了50000个文档。 So essentially my training matrix is: 50000 * 6 million and want to reduce it to 50000 * (100 to 500) 基本上我的训练矩阵是:50000 * 600万,并希望将其减少到50000 *(100到500)

Is there any other way I can implement it in python, or do I have to implement sparks mllib SVD(written for only java and scala) ? 有没有其他方法可以在python中实现它,或者我是否必须实现sparks mllib SVD(仅针对java和scala编写)? If Yes, how much faster will it be? 如果是的话,会有多快?

System specification: 32 Gb RAM with 4 core processors on ubuntu 14.04 系统规格:ubuntu 14.04上带有4个核心处理器的32 Gb RAM

I don't really see why using sparks mllib SVD would improve performance or avoid memory errors. 我真的不明白为什么使用spark mllib SVD可以提高性能或避免内存错误。 You simply exceed the size of your RAM. 您只需超过RAM的大小。 You have some options to deal with that: 你有一些选择来处理:

  • Reduce the dictionary size of your tf-idf (playing with max_df and min_df parameters of scikit-learn for example). 减少tf-idf的字典大小(例如,使用scikit-learn的max_df和min_df参数)。
  • Use a hashing vectorizer instead of tf-idf. 使用散列矢量化器而不是tf-idf。
  • Get more RAM (but at some point tf-idf + SVD is not scalable). 获得更多RAM(但在某些时候tf-idf + SVD不可扩展)。

Also you should show your code sample, you might do something wrong in your python code. 你也应该展示你的代码示例,你可能在你的python代码中做错了。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM