简体   繁体   中英

How to graph Learning Curve of Logistic Regression?

I am running a Logistic Regression and would like to plot the Learning Curve of this to get a feel for the data. How can I do this ? Here is my code thus far :

 from sklearn import metrics,preprocessing,cross_validation
  from sklearn.feature_extraction.text import TfidfVectorizer
  import sklearn.linear_model as lm
  import pandas as p
  loadData = lambda f: np.genfromtxt(open(f,'r'), delimiter=' ')

  print "loading data.."
  traindata = list(np.array(p.read_table('train.tsv'))[:,2])
  testdata = list(np.array(p.read_table('test.tsv'))[:,2])
  y = np.array(p.read_table('train.tsv'))[:,-1]

  tfv = TfidfVectorizer(min_df=3,  max_features=None, strip_accents='unicode',  
        analyzer='word',token_pattern=r'\w{1,}',ngram_range=(1, 2), use_idf=1,smooth_idf=1,sublinear_tf=1)

  rd = lm.LogisticRegression(penalty='l2', dual=True, tol=0.0001, 
                             C=1, fit_intercept=True, intercept_scaling=1.0, 
                             class_weight=None, random_state=None)

  X_all = traindata + testdata
  lentrain = len(traindata)

  print "fitting pipeline"
  tfv.fit(X_all)
  print "transforming data"
  X_all = tfv.transform(X_all)

  X = X_all[:lentrain]
  X_test = X_all[lentrain:]

  print "20 Fold CV Score: ", np.mean(cross_validation.cross_val_score(rd, X, y, cv=20, scoring='roc_auc'))

  print "training on full data"
  rd.fit(X,y)
  pred = rd.predict_proba(X_test)[:,1]
  testfile = p.read_csv('test.tsv', sep="\t", na_values=['?'], index_col=1)
  pred_df = p.DataFrame(pred, index=testfile.index, columns=['label'])
  pred_df.to_csv('benchmark.csv')
  print "submission file created.."

What I would like to create is something like this, so I can have a better understanding of what is going on :

预期输出的图像

Can anyone help me with this please?

Not quite as general as it should be, but it'll do the job with a little fiddling on your end.

from matplotlib import pyplot as plt
from sklearn import metrics
import numpy as np

def data_size_response(model,trX,teX,trY,teY,score_func,prob=True,n_subsets=20):

    train_errs,test_errs = [],[]
    subset_sizes = np.exp(np.linspace(3,np.log(trX.shape[0]),n_subsets)).astype(int)

    for m in subset_sizes:
        model.fit(trX[:m],trY[:m])
        if prob:
            train_err = score_func(trY[:m],model.predict_proba(trX[:m]))
            test_err = score_func(teY,model.predict_proba(teX))
        else:
            train_err = score_func(trY[:m],model.predict(trX[:m]))
            test_err = score_func(teY,model.predict(teX))
        print "training error: %.3f test error: %.3f subset size: %.3f" % (train_err,test_err,m)
        train_errs.append(train_err)
        test_errs.append(test_err)

    return subset_sizes,train_errs,test_errs

def plot_response(subset_sizes,train_errs,test_errs):

    plt.plot(subset_sizes,train_errs,lw=2)
    plt.plot(subset_sizes,test_errs,lw=2)
    plt.legend(['Training Error','Test Error'])
    plt.xscale('log')
    plt.xlabel('Dataset size')
    plt.ylabel('Error')
    plt.title('Model response to dataset size')
    plt.show()

model = # put your model here
score_func = # put your scoring function here
response = data_size_response(model,trX,teX,trY,teY,score_func,prob=True)
plot_response(*response)

The data_size_response function takes a model (in your case a instantiated LR model), a pre-split dataset (train/test X and Y arrays you can use the train_test_split function in sklearn to generate this), and a scoring function as input and iterates through your dataset training on n exponentially spaced subsets and returns the "learning curve". There is also a plotting function for visualizing this response.

I would have liked to use cross_val_score like your example but it would require modifying sklearn source to get back training scores in addition to the test scores it already provides. The prob argument is whether or not to use a predict_proba vs predict method on the model which is necessary for certain model/scoring function combinations eg roc_auc_score.

Example plot on a subset of the MNIST dataset: 在此处输入图片说明

Let me know if you have any questions!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM