[英]How to efficiently serialize a scikit-learn classifier
What's the most efficient way to serialize a scikit-learn classifier? 什么是序列化scikit-learn分类器的最有效方法?
I'm currently using Python's standard Pickle module to serialize a text classifier , but this results in a monstrously large pickle. 我目前正在使用Python的标准Pickle模块来序列化文本分类器 ,但这会导致一个巨大的泡菜。 The serialized object can be 100MB or more, which seems excessive and takes a while to generate and store.
序列化对象可能是100MB或更多,这似乎过多,需要一段时间来生成和存储。 I've done similar work with Weka, and the equivalent serialized classifier is usually just a couple of MBs.
我已经完成了与Weka类似的工作,等效的序列化分类器通常只有几MB。
Is scikit-learn possibly caching the training data, or other extraneous info, in the pickle? scikit-learn可能会在pickle中缓存训练数据或其他无关信息吗? If so, how can I speed up and reduce the size of serialized scikit-learn classifiers?
如果是这样,我怎样才能加快并减少序列化scikit-learn分类器的大小?
classifier = Pipeline([
('vectorizer', CountVectorizer(ngram_range=(1,4))),
('tfidf', TfidfTransformer()),
('clf', OneVsRestClassifier(LinearSVC())),
])
For large text datasets, use the hashing trick: replace the TfidfVectorizer
by a HashingVectorizer
(potentially stacked with a TfidfTransformer
in the pipeline): it will be much faster to pickle as you won't have to store the vocabulary dict any more as discussed recently in this question: 对于大型文本数据集,使用散列技巧:用
TfidfVectorizer
替换HashingVectorizer
(可能与管道中的TfidfTransformer
堆叠):pickle会快得多,因为你不必再存储词汇dict了,如上所述在这个问题中:
How can i reduce memory usage of Scikit-Learn Vectorizers? 如何减少Scikit-Learn Vectorizers的内存使用量?
You can also use joblib.dump and pass in a compression. 您还可以使用joblib.dump并传入压缩。 I noticed my classifier pickle dumps reducing by a factor of ~16 using option compress=3.
我注意到我的分类器pickle转储使用选项compress = 3减少了~16倍。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.