简体繁体中英

How can i reduce memory usage of Scikit-Learn Vectorizers?

原文 2013-07-08 21:36:35 9 2 python/ numpy/ machine-learning/ scipy/ scikit-learn

TFIDFVectorizer takes so much memory ,vectorizing 470 MB of 100k documents takes over 6 GB , if we go 21 million documents it will not fit 60 GB of RAM we have.

So we go for HashingVectorizer but still need to know how to distribute the hashing vectorizer.Fit and partial fit does nothing so how to work with Huge Corpus?

2 answers

I would strongly recommend you to use the HashingVectorizer when fitting models on large dataset.

The HashingVectorizer is data independent, only the parameters from vectorizer.get_params() are important. Hence (un)pickling `HashingVectorizer instance should be very fast.

The vocabulary based vectorizers are better suited for exploratory analysis on small datasets.

克服HashingVectorizer无法解释IDF的一种方法是将数据索引到elasticsearch或lucene，并从那里检索termvectors，使用它们可以计算Tf-IDF。

Custom tokenizer for scikit-learn vectorizers

Using scikit-learn vectorizers and vocabularies with gensim

scikit-learn DBSCAN memory usage

How can i distribute processing of minibatch kmeans (scikit-learn)?

How can I make Scikit-learn TfidfVectorizer not to preprocess the text?

How can I create a scikit-learn tree by hand?

scikit-learn Random Forest excessive memory usage

How can I run python scikit-learn on Raspberry Pi?

Object has no attribute in scikit-learn, how can I access it?

How can I classify big text data with scikit-learn?

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Custom tokenizer for scikit-learn vectorizers Using scikit-learn vectorizers and vocabularies with gensim scikit-learn DBSCAN memory usage How can i distribute processing of minibatch kmeans (scikit-learn)? How can I make Scikit-learn TfidfVectorizer not to preprocess the text? How can I create a scikit-learn tree by hand? scikit-learn Random Forest excessive memory usage How can I run python scikit-learn on Raspberry Pi? Object has no attribute in scikit-learn, how can I access it? How can I classify big text data with scikit-learn?

Related Tags

How can i reduce memory usage of Scikit-Learn Vectorizers?

Question

2 answers

solution1
9 ACCPTED 2013-07-08 21:57:29

solution2
0 2015-02-10 04:49:37

How can i reduce memory usage of Scikit-Learn Vectorizers?

Question

2 answers

solution1 9 ACCPTED 2013-07-08 21:57:29

solution2 0 2015-02-10 04:49:37

solution1
9 ACCPTED 2013-07-08 21:57:29

solution2
0 2015-02-10 04:49:37