如何有效地序列化scikit-learn分类器

Question

What's the most efficient way to serialize a scikit-learn classifier? 什么是序列化scikit-learn分类器的最有效方法？

I'm currently using Python's standard Pickle module to serialize a text classifier , but this results in a monstrously large pickle. 我目前正在使用Python的标准Pickle模块来序列化文本分类器，但这会导致一个巨大的泡菜。 The serialized object can be 100MB or more, which seems excessive and takes a while to generate and store. 序列化对象可能是100MB或更多，这似乎过多，需要一段时间来生成和存储。 I've done similar work with Weka, and the equivalent serialized classifier is usually just a couple of MBs. 我已经完成了与Weka类似的工作，等效的序列化分类器通常只有几MB。

Is scikit-learn possibly caching the training data, or other extraneous info, in the pickle? scikit-learn可能会在pickle中缓存训练数据或其他无关信息吗？ If so, how can I speed up and reduce the size of serialized scikit-learn classifiers? 如果是这样，我怎样才能加快并减少序列化scikit-learn分类器的大小？

classifier = Pipeline([
    ('vectorizer', CountVectorizer(ngram_range=(1,4))),
    ('tfidf', TfidfTransformer()),
    ('clf', OneVsRestClassifier(LinearSVC())),
])

Answer 1

For large text datasets, use the hashing trick: replace the TfidfVectorizer by a HashingVectorizer (potentially stacked with a TfidfTransformer in the pipeline): it will be much faster to pickle as you won't have to store the vocabulary dict any more as discussed recently in this question: 对于大型文本数据集，使用散列技巧：用TfidfVectorizer替换HashingVectorizer （可能与管道中的TfidfTransformer堆叠）：pickle会快得多，因为你不必再存储词汇dict了，如上所述在这个问题中：

How can i reduce memory usage of Scikit-Learn Vectorizers? 如何减少Scikit-Learn Vectorizers的内存使用量？

Answer 2

You can also use joblib.dump and pass in a compression. 您还可以使用joblib.dump并传入压缩。 I noticed my classifier pickle dumps reducing by a factor of ~16 using option compress=3. 我注意到我的分类器pickle转储使用选项compress = 3减少了~16倍。

如何有效地序列化scikit-learn分类器

问题描述

2 个解决方案

解决方案1
5 已采纳 2013-07-11 08:02:33

解决方案2
4 2015-11-11 19:49:57

如何有效地序列化scikit-learn分类器

问题描述

2 个解决方案

解决方案1 5 已采纳 2013-07-11 08:02:33

解决方案2 4 2015-11-11 19:49:57

解决方案1
5 已采纳 2013-07-11 08:02:33

解决方案2
4 2015-11-11 19:49:57