简体   繁体   English

在 scikit-learn 中将分类器保存到磁盘

[英]Save classifier to disk in scikit-learn

How do I save a trained Naive Bayes classifier to disk and use it to predict data?如何将经过训练的朴素贝叶斯分类器保存到磁盘并使用它来预测数据?

I have the following sample program from the scikit-learn website:我有以下来自 scikit-learn 网站的示例程序:

from sklearn import datasets
iris = datasets.load_iris()
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
y_pred = gnb.fit(iris.data, iris.target).predict(iris.data)
print "Number of mislabeled points : %d" % (iris.target != y_pred).sum()

Classifiers are just objects that can be pickled and dumped like any other.分类器只是可以像任何其他对象一样被腌制和转储的对象。 To continue your example:继续你的例子:

import cPickle
# save the classifier
with open('my_dumped_classifier.pkl', 'wb') as fid:
    cPickle.dump(gnb, fid)    

# load it again
with open('my_dumped_classifier.pkl', 'rb') as fid:
    gnb_loaded = cPickle.load(fid)

Edit: if you are using a sklearn Pipeline in which you have custom transformers that cannot be serialized by pickle (nor by joblib ), then using Neuraxle's custom ML Pipeline saving is a solution where you can define your own custom step savers on a per-step basis.编辑:如果您使用的sklearn 管道中具有无法通过 pickle(也不能通过 joblib )序列化的自定义转换器,那么使用 Neuraxle 的自定义 ML 管道保存是一种解决方案,您可以在其中定义自己的自定义步骤保存程序步基础。 The savers are called for each step if defined upon saving, and otherwise joblib is used as default for steps without a saver.如果在保存时定义,则为每个步骤调用保存程序,否则将使用 joblib 作为没有保存程序的步骤的默认值。

You can also use joblib.dump and joblib.load which is much more efficient at handling numerical arrays than the default python pickler.您还可以使用joblib.dumpjoblib.load ,它们在处理数值数组方面比默认的 python pickler更有效。

Joblib is included in scikit-learn: Joblib 包含在 scikit-learn 中:

>>> import joblib
>>> from sklearn.datasets import load_digits
>>> from sklearn.linear_model import SGDClassifier

>>> digits = load_digits()
>>> clf = SGDClassifier().fit(digits.data, digits.target)
>>> clf.score(digits.data, digits.target)  # evaluate training error
0.9526989426822482

>>> filename = '/tmp/digits_classifier.joblib.pkl'
>>> _ = joblib.dump(clf, filename, compress=9)

>>> clf2 = joblib.load(filename)
>>> clf2
SGDClassifier(alpha=0.0001, class_weight=None, epsilon=0.1, eta0=0.0,
       fit_intercept=True, learning_rate='optimal', loss='hinge', n_iter=5,
       n_jobs=1, penalty='l2', power_t=0.5, rho=0.85, seed=0,
       shuffle=False, verbose=0, warm_start=False)
>>> clf2.score(digits.data, digits.target)
0.9526989426822482

Edit: in Python 3.8+ it's now possible to use pickle for efficient pickling of object with large numerical arrays as attributes if you use pickle protocol 5 (which is not the default).编辑:在 Python 3.8+ 中,如果您使用 pickle 协议 5(这不是默认设置),则现在可以使用 pickle 对具有大型数值数组作为属性的对象进行有效酸洗。

What you are looking for is called Model persistence in sklearn words and it is documented in introduction and in model persistence sections.您正在寻找的是 sklearn 中的模型持久性,它在介绍模型持久性部分中有记录。

So you have initialized your classifier and trained it for a long time with所以你已经初始化了你的分类器并用它训练了很长时间

clf = some.classifier()
clf.fit(X, y)

After this you have two options:在此之后,您有两个选择:

1) Using Pickle 1) 使用泡菜

import pickle
# now you can save it to a file
with open('filename.pkl', 'wb') as f:
    pickle.dump(clf, f)

# and later you can load it
with open('filename.pkl', 'rb') as f:
    clf = pickle.load(f)

2) Using Joblib 2) 使用 Joblib

from sklearn.externals import joblib
# now you can save it to a file
joblib.dump(clf, 'filename.pkl') 
# and later you can load it
clf = joblib.load('filename.pkl')

One more time it is helpful to read the above-mentioned links再次阅读上述链接很有帮助

In many cases, particularly with text classification it is not enough just to store the classifier but you'll need to store the vectorizer as well so that you can vectorize your input in future.在许多情况下,特别是对于文本分类,仅存储分类器是不够的,您还需要存储向量化器,以便将来可以对输入进行向量化。

import pickle
with open('model.pkl', 'wb') as fout:
  pickle.dump((vectorizer, clf), fout)

future use case:未来用例:

with open('model.pkl', 'rb') as fin:
  vectorizer, clf = pickle.load(fin)

X_new = vectorizer.transform(new_samples)
X_new_preds = clf.predict(X_new)

Before dumping the vectorizer, one can delete the stop_words_ property of vectorizer by:在转储向量化器之前,可以通过以下方式删除向量化器的 stop_words_ 属性:

vectorizer.stop_words_ = None

to make dumping more efficient.使倾销更有效率。 Also if your classifier parameters is sparse (as in most text classification examples) you can convert the parameters from dense to sparse which will make a huge difference in terms of memory consumption, loading and dumping.此外,如果您的分类器参数是稀疏的(如在大多数文本分类示例中),您可以将参数从密集转换为稀疏,这将在内存消耗、加载和转储方面产生巨大差异。 Sparsify the model by:通过以下方式稀疏模型:

clf.sparsify()

Which will automatically work for SGDClassifier but in case you know your model is sparse (lots of zeros in clf.coef_) then you can manually convert clf.coef_ into a csr scipy sparse matrix by:这将自动适用于SGDClassifier,但如果您知道您的模型是稀疏的(clf.coef_ 中有很多零),那么您可以通过以下方式手动将 clf.coef_ 转换为csr scipy 稀疏矩阵

clf.coef_ = scipy.sparse.csr_matrix(clf.coef_)

and then you can store it more efficiently.然后你可以更有效地存储它。

sklearn estimators implement methods to make it easy for you to save relevant trained properties of an estimator. sklearn estimators 实现了一些方法,使您可以轻松保存估计器的相关训练属性。 Some estimators implement __getstate__ methods themselves, but others, like the GMM just use the base implementation which simply saves the objects inner dictionary:一些估算器自己实现__getstate__方法,但其他估算器,如GMM只使用简单地保存对象内部字典的基本实现

def __getstate__(self):
    try:
        state = super(BaseEstimator, self).__getstate__()
    except AttributeError:
        state = self.__dict__.copy()

    if type(self).__module__.startswith('sklearn.'):
        return dict(state.items(), _sklearn_version=__version__)
    else:
        return state

The recommended method to save your model to disc is to use the pickle module:将模型保存到光盘的推荐方法是使用pickle模块:

from sklearn import datasets
from sklearn.svm import SVC
iris = datasets.load_iris()
X = iris.data[:100, :2]
y = iris.target[:100]
model = SVC()
model.fit(X,y)
import pickle
with open('mymodel','wb') as f:
    pickle.dump(model,f)

However, you should save additional data so you can retrain your model in the future, or suffer dire consequences (such as being locked into an old version of sklearn) .但是,您应该保存额外的数据,以便将来可以重新训练您的模型,否则会遭受可怕的后果(例如被锁定在旧版本的 sklearn 中)

From the documentation :文档

In order to rebuild a similar model with future versions of scikit-learn, additional metadata should be saved along the pickled model:为了使用未来版本的 scikit-learn 重建类似的模型,应该在腌制模型中保存额外的元数据:

The training data, eg a reference to a immutable snapshot训练数据,例如对不可变快照的引用

The python source code used to generate the model用于生成模型的python源代码

The versions of scikit-learn and its dependencies scikit-learn 的版本及其依赖

The cross validation score obtained on the training data在训练数据上获得的交叉验证分数

This is especially true for Ensemble estimators that rely on the tree.pyx module written in Cython(such as IsolationForest ), since it creates a coupling to the implementation, which is not guaranteed to be stable between versions of sklearn.对于依赖于tree.pyx编写的tree.pyx模块(例如IsolationForest )的Ensemble 估计器尤其如此,因为它创建了与实现的耦合,并不能保证在 sklearn 的版本之间是稳定的。 It has seen backwards incompatible changes in the past.它在过去看到了向后不兼容的变化。

If your models become very large and loading becomes a nuisance, you can also use the more efficient joblib .如果您的模型变得非常大并且加载变得麻烦,您还可以使用更高效的joblib From the documentation:从文档:

In the specific case of the scikit, it may be more interesting to use joblib's replacement of pickle ( joblib.dump & joblib.load ), which is more efficient on objects that carry large numpy arrays internally as is often the case for fitted scikit-learn estimators, but can only pickle to the disk and not to a string:在 scikit 的特定情况下,使用 joblib 替代picklejoblib.dump & joblib.load )可能更有趣,这在内部携带大型 numpy 数组的对象上更有效,这对于拟合 scikit 来说通常是这种情况-学习估计器,但只能pickle到磁盘而不是字符串:

sklearn.externals.joblib has been deprecated since 0.21 and will be removed in v0.23 : sklearn.externals.joblib0.21 sklearn.externals.joblib已被弃用,并将在v0.23删除:

/usr/local/lib/python3.7/site-packages/sklearn/externals/joblib/ init .py:15: FutureWarning: sklearn.externals.joblib is deprecated in 0.21 and will be removed in 0.23. /usr/local/lib/python3.7/site-packages/sklearn/externals/joblib/ INIT的.py:15:FutureWarning:sklearn.externals.joblib已被弃用在0.21,将在0.23被除去。 Please import this functionality directly from joblib, which can be installed with: pip install joblib.请直接从 joblib 导入此功能,可以使用以下命令安装:pip install joblib。 If this warning is raised when loading pickled models, you may need to re-serialize those models with scikit-learn 0.21+.如果在加载腌制模型时出现此警告,您可能需要使用 scikit-learn 0.21+ 重新序列化这些模型。
warnings.warn(msg, category=FutureWarning)警告。警告(味精,类别=未来警告)


Therefore, you need to install joblib :因此,您需要安装joblib

pip install joblib

and finally write the model to disk:最后将模型写入磁盘:

import joblib
from sklearn.datasets import load_digits
from sklearn.linear_model import SGDClassifier


digits = load_digits()
clf = SGDClassifier().fit(digits.data, digits.target)

with open('myClassifier.joblib.pkl', 'wb') as f:
    joblib.dump(clf, f, compress=9)

Now in order to read the dumped file all you need to run is:现在为了读取转储的文件,您需要运行的是:

with open('myClassifier.joblib.pkl', 'rb') as f:
    my_clf = joblib.load(f)

How do I save a trained Naive Bayes classifier to disk and use it to predict data?如何将训练有素的朴素贝叶斯分类器保存到磁盘并用于预测数据?

I have the following sample program from the scikit-learn website:我有来自scikit-learn网站的以下示例程序:

from sklearn import datasets
iris = datasets.load_iris()
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
y_pred = gnb.fit(iris.data, iris.target).predict(iris.data)
print "Number of mislabeled points : %d" % (iris.target != y_pred).sum()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM