简体   繁体   English

如何有效地序列化scikit-learn分类器

[英]How to efficiently serialize a scikit-learn classifier

What's the most efficient way to serialize a scikit-learn classifier? 什么是序列化scikit-learn分类器的最有效方法?

I'm currently using Python's standard Pickle module to serialize a text classifier , but this results in a monstrously large pickle. 我目前正在使用Python的标准Pickle模块来序列化文本分类器 ,但这会导致一个巨大的泡菜。 The serialized object can be 100MB or more, which seems excessive and takes a while to generate and store. 序列化对象可能是100MB或更多,这似乎过多,需要一段时间来生成和存储。 I've done similar work with Weka, and the equivalent serialized classifier is usually just a couple of MBs. 我已经完成了与Weka类似的工作,等效的序列化分类器通常只有几MB。

Is scikit-learn possibly caching the training data, or other extraneous info, in the pickle? scikit-learn可能会在pickle中缓存训练数据或其他无关信息吗? If so, how can I speed up and reduce the size of serialized scikit-learn classifiers? 如果是这样,我怎样才能加快并减少序列化scikit-learn分类器的大小?

classifier = Pipeline([
    ('vectorizer', CountVectorizer(ngram_range=(1,4))),
    ('tfidf', TfidfTransformer()),
    ('clf', OneVsRestClassifier(LinearSVC())),
])

For large text datasets, use the hashing trick: replace the TfidfVectorizer by a HashingVectorizer (potentially stacked with a TfidfTransformer in the pipeline): it will be much faster to pickle as you won't have to store the vocabulary dict any more as discussed recently in this question: 对于大型文本数据集,使用散列技巧:用TfidfVectorizer替换HashingVectorizer (可能与管道中的TfidfTransformer堆叠):pickle会快得多,因为你不必再存储词汇dict了,如上所述在这个问题中:

How can i reduce memory usage of Scikit-Learn Vectorizers? 如何减少Scikit-Learn Vectorizers的内存使用量?

You can also use joblib.dump and pass in a compression. 您还可以使用joblib.dump并传入压缩。 I noticed my classifier pickle dumps reducing by a factor of ~16 using option compress=3. 我注意到我的分类器pickle转储使用选项compress = 3减少了~16倍。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM