简体   繁体   English

如何避免重新训练机器学习模型

[英]How do I avoid re-training machine learning models

self-learner here. 自学者在这里。

I am building a web application that predict events. 我正在构建一个预测事件的Web应用程序。

Let's consider this quick example. 让我们考虑一下这个简单的例子。

X = [[0], [1], [2], [3]]
y = [0, 0, 1, 1]
from sklearn.neighbors import KNeighborsClassifier
neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(X, y) 

print(neigh.predict([[1.1]]))

How can I keep the state of neigh so when I enter a new value like neigh.predict([[1.2]]) I don't need to re-train the model. 我怎样才能保持状态neigh ,所以当我进入像一个新值neigh.predict([[1.2]])我并不需要重新训练模型。 Is there any good practice, or hint to start solving the problem ? 有没有好的做法,或暗示开始解决问题?

You've chosen a slightly confusing example for a couple of reasons. 出于几个原因,你选择了一个有点混乱的例子。 First, when you say neigh.predict([[1.2]]) , you aren't adding a new training point, you're just doing a new prediction, so that doesn't require any changes at all. 首先,当你说neigh.predict([[1.2]]) ,你没有添加一个新的训练点,你只是在做一个新的预测,因此根本不需要任何改变。 Second, KNN algorithms aren't really "trained" -- KNN is an instance-based algorithm, which means that "training" amounts to storing the training data in a suitable structure. 其次,KNN算法并未真正“训练” - KNN是一种基于实例的算法,这意味着“训练”相当于将训练数据存储在合适的结构中。 As a result, this question has two different answers. 因此,这个问题有两个不同的答案。 I'll try to answer the KNN question first. 我会先尝试回答KNN问题。

K Nearest Neighbors K最近的邻居

For KNN, adding new training data amounts to appending new data points to the structure. 对于KNN,添加新的训练数据相当于将新数据点附加到结构。 However, it appears that scikit-learn doesn't provide any such functionality. 但是,似乎scikit-learn不提供任何此类功能。 (That's reasonable enough -- since KNN explicitly stores every training point, you can't just keep giving it new training points indefinitely.) (这是合理的 - 因为KNN明确存储了每个训练点,你不能无限期地继续给它新的训练点。)

If you aren't using many training points, a simple list might be good enough for your needs! 如果您没有使用许多培训点,那么简单的列表可能足以满足您的需求! In that case, you could skip sklearn altogether, and just append new data points to your list. 在这种情况下,您可以完全跳过sklearn ,只需将新数据点附加到列表中即可。 To make a prediction, do a linear search, saving the k nearest neighbors, and then make a prediction based on a simple "majority vote" -- if out of five neighbors, three or more are red, then return red, and so on. 要进行预测,请进行线性搜索,保存k最近邻居,然后根据简单的“多数投票”进行预测 - 如果五个邻居中有三个或更多是红色,则返回红色,依此类推。 But keep in mind that every training point you add will slow the algorithm. 但请记住,您添加的每个训练点都会降低算法速度。

If you need to use many training points, you'll want to use a more efficient structure for nearest neighbor search, like a KD Tree . 如果您需要使用许多训练点,您将需要使用更有效的结构进行最近邻搜索,例如KD树 There's a scipy KD Tree implementation that ought to work. 有一个scipy KD Tree实现应该工作。 The query method allows you to find the k nearest neighbors. query方法允许您查找k最近邻居。 It will be more efficient than a list, but it will still get slower as you add more training data. 它比列表更有效,但随着您添加更多训练数据,它仍然会变慢。

Online Learning 在线学习

A more general answer to your question is that you are (unbeknownst to yourself) trying to do something called online learning . 对你的问题更一般的回答是,你(不知道自己)试图做一些叫做在线学习的事情。 Online learning algorithms allow you to use individual training points as they arrive, and discard them once they've been used. 在线学习算法允​​许您在到达时使用各个训练点,并在使用后将其丢弃。 For this to make sense, you need to be storing not the training points themselves (as in KNN) but a set of parameters, which you optimize. 为此,您需要不是存储训练点本身(如KNN中),而是存储您优化的一组参数。

This means that some algorithms are better suited to this than others. 这意味着某些算法比其他算法更适合这种算法。 sklearn provides just a few algorithms capable of online learning . sklearn提供了一些能够在线学习的算法。 These all have a partial_fit method that will allow you to pass training data in batches. 这些都有一个partial_fit方法,允许您批量传递训练数据。 The SKDClassifier with 'hinge' or 'log' loss is probably a good starting point. 具有'hinge''log'损失的SKDClassifier器可能是一个很好的起点。

Or maybe you just want to save your model after fitting 或者您可能只是想在安装后保存您的模型

joblib.dump(neigh, FName)

and load it when needed 并在需要时加载它

neigh = joblib.load(FName)
neigh.predict([[1.1]])

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在重新训练时初始化tensorflow.contrib.slim fully_connected图层的权重和偏差? - How can I initialize weights and biases of tensorflow.contrib.slim fully_connected layer while re-training? 训练基于个性化的机器学习模型 - Training personalized based machine learning models 使用新语料库重新训练Word2vec,如何为现有词汇更新权重? - Word2vec Re-training with new corpus, how the weights will be updated for the existing vocabulary? 如何将存储在3个数据帧中的3个机器学习模型的结果合并/集合化,并输出1个数据帧并获得大多数人同意的结果? - How do I combine/ensemble results of 3 machine learning models stored in 3 dataframes and output 1 dataframe with results agreed by majority? 机器学习模型如何更新? - How are machine learning models updated? 在多输出分类问题的情况下,如何正确比较机器学习模型的性能? - How do I properly compare performance of machine learning models, in the case of a multi-output classification problem? Scikit-learn 机器学习模型训练使用多个 CPU - Scikit-learn machine learning models training using multiple CPUs 在使用Tensorflow对象检测API重新训练预训练模型时,为什么用这种方法标记训练数据会导致不良对象检测? - Why is labeling training data this way leading to bad object detection when re-training pre-trained models using the tensorflow object detection api? 为电子商务客户/订单模型训练机器学习模型 - Training machine learning model for ecommerce customer/orders models 重新训练初始Google Cloud处于全局步骤0 - Re-training inception google cloud stuck at global step 0
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM