简体   繁体   中英

partial_fit with SGDClassifier gives fluctuating accuracy

I have my data in a sparse matrix. I work now first on a subset with ~500k rows before starting the big computation. The data is bigram counts plus entropy and string length, and the complete dataset contains 100s of millions of rows times 1400 columns. The model is meant to help characterise these strings, so I use SGDClassifier for logistic regression.

Because of the large size I decided to use partial_fit on my SGDClassifier , but the calculated area-under-curve value I get at each epoch seems to fluctuate a lot.

Here is my code:

from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
model = SGDClassifier(loss='log', alpha=1e-10, n_iter=50, n_jobs=-1, shuffle=True)
for f in file_list:
    data = dill.load(open(f))
    X_train, X_test, y_train, y_test = train_test_split(data, labels, test_size=0.2)
    X_train, X_holdout, y_train, y_holdout = train_test_split(data, labels, test_size=0.05)
    for ep in range(max_epoch):
        model.partial_fit(X_train, y_train, classes=np.unique(y_train))

        # Calculate Area under ROC curve to see if things improve
        probs = model.predict_proba(X_holdout)
        auc   = roc_auc_score(y_holdout, [x[1] for x in probs])

        if auc > best_auc: best_auc = auc
        print('Epoch: %d - auc: %.2f (best %.2f)' %(ep, auc, best_auc))

What happens is that auc quickly goes up to ~0.9 but then fluctuates alot. Sometimes it drops to ~0.5-0.6 even and then back up. I thought that more logically auc should continue to generally increase with each epoch, with only small dips possible, until it finds a equilibrium value where more training hardly improve anything.

Is there anything I am doing wrong, or is this a possible "normal" behaviour with partial_fit ? I never saw this behaviour when I used fit on the smaller dataset.

Usually, partial_fit has seen to be prone to reduction or fluctuation in accuracy. To some extent, this can be slightly mitigated by shuffling and providing only small fractions of the entire dataset. But, for larger data, online training only seems to give reducing accuracies, with SGDClassifier/SVM Classifier.

I tried to experiment with it and discovered that using a low learning rate can help us sometimes. The rough analogy is, on training the same model on large data repeteadly, leads to the model forgetting what it learnt from the previous data. So, using a tiny learning rate slows down the rate of learning as well as forgetting!

Rather than manually providing a rate, we can use adaptive learning rate functionality provided by sklearn . Notice the model initialisation part,

model = SGDClassifier(loss="hinge", penalty="l2", alpha=0.0001, max_iter=3000, tol=None, shuffle=True, verbose=0, learning_rate='adaptive', eta0=0.01, early_stopping=False)

This is described in the [scikit docs] as:

'adaptive': eta = eta0, as long as the training keeps decreasing. Each time n_iter_no_change consecutive epochs fail to decrease the training loss by tol or fail to increase validation score by tol if early_stopping is True, the current learning rate is divided by 5.

I got really good results (from initial drop from 98% to 28% in fourth part of the dataset) to 100% model accuracy with the change in learning rate.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM