简体   繁体   中英

Retraining an existing machine learning model with new data

I have a ML model which is trained on a million data set (supervised classification on text) , however i want the same model to get trained again as soon as a new data comes in (training data).

This process is continuous and i do not want to loose the power of the model's prediction every time it receives a new data set. I do not want to merge the new data with my history data (~1 million samples) to train again.

So the ideal would be for this model to grow up gradually training on all data over a period of time and preserving the intelligence of the model every time it receives a new training set data. What is the best way to avoid retraining all historical data ? A Code sample would help me.

You want to a look into incremental learning techniques for that. Many scikit-learn estimators have an option to do a partial_fit of the data, which means that you can incrementally train on small batches of data.

A common approach for these cases is to use SGDClassifier (or regressor), which is trained by taking a fraction of the samples to update the parameters of the model on each iteration, thus making it a natural candidate for online learning problems. However, you must retrain the model through the method partial_fit , otherwise it will train the whole model again.

From the documentation

SGD allows minibatch (online/out-of-core) learning, see the partial_fit method

Though as mentioned there are several other estimators in scikit-learn that have the partial-fit API implemented, as you can see in the section incremental learning , including MultinomialNB , linear_model.Perceptron and MiniBatchKMeans among others.


Here's a toy example to illustrate how it's used:

from sklearn.datasets import load_iris
from sklearn.model_selection import KFold
from sklearn.linear_model import SGDClassifier

X, y = load_iris(return_X_y=True)

clf = SGDClassifier()

kf = KFold(n_splits=2)
kf_splits = kf.split(X)

train_index, test_index = next(kf_splits)
# partial-fit with training data. On the first call
# the classes must be provided
clf.partial_fit(X[train_index], y[train_index], classes=np.unique(y))

# re-training on new data
train_index, test_index = next(kf_splits)
clf.partial_fit(X[train_index], y[train_index])

What you are looking for is incremental learning, there is an excellent library called creme which helps you with that.

All the tools in the library can be updated with a single observation at a time, and can therefore be used to learn from streaming data.

Here are some benefits of using creme (and online machine learning in general):

Incremental: models can update themselves in real-time. Adaptive: models can adapt to concept drift. Production-ready: working with data streams makes it simple to replicate production scenarios during model development. Efficient: models don't have to be retrained and require little compute power, which lowers their carbon footprint Fast: when the goal is to learn and predict with a single instance at a time, then creme is a order of magnitude faster than PyTorch, Tensorflow, and scikit-learn. 🔥 Features

Check out this: https://pypi.org/project/creme/

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM