简体   繁体   中英

How can I classify big text data with scikit-learn?

I have large database of 50GB in size, which consists of excerpts of 486,000 dissertations in 780 specialties. For scientific purposes, it is necessary to conduct training on the basis of this data. But alas, resources are limited to a mobile processor, 16 GB of memory (+ 16Gb SWAP)

The analysis was carried out using a set of 40,000 items (10% of the base) (4.5 GB) and the SGDClassifier classifier, and the memory consumption was around 16-17 GB.

Therefore, I ask the community for help on this.

currently my code is similar

text_clf = Pipeline([
     ('count', CountVectorizer()),
     ('tfidf', TfidfTransformer()),
     ('clf', SGDClassifier(n_jobs=8),)
 ],
 )
texts_train, texts_test, cat_train, cat_test = train_test_split(texts, categories_ids, test_size=0.2)
text_clf.fit(texts_train, cat_train)

Therefore, I ask for advice on how to optimize this process so that I can process the entire database.

You can utilize warm_start=True and call .partial_fit() (instead of .fit() ).

See the documentation here for the model you are using where it describes that argument and function respectively.

Basically, you would load only a portion of the data at a time, run it through your pipeline and call partial_fit in a loop. This would keep the memory requirements down while also allowing you to train on all the data, regardless of the amount.

EDIT

As noted in the comments, the above mentioned loop will only work for the predictive model, so the data pre-processing will need to occur separately.

Here is a solution for training the CountVectorizer iteratively.

This question contains a TFIDF implementation that doesn't require all of the data to be loaded into memory.

So the final solution would be to preprocess the data in two stages. The first for the CountVectorizer and the second for the TFIDF weighting.

Then to train the model you follow the same process as originally proposed, except without a Pipeline because that is no longer needed.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM