简体   繁体   中英

scikit-learn logistic regression feature importance

I'm looking for a way to get an idea of the impact of the features I'm using in a classification problem. Using sklearn's logistic regression classifier ( http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html ), I understood that the .coef_ attribute gets me the information I'm after (as also discussed in this thread: How to find the importance of the features for a logistic regression model? ).

The first few lines of my matrix:

phrase_type,type,complex_np,np_form,referentiality,grammatical_role,ambiguity,anaphor_type,dir_speech,length_of_span,length_of_coref_chain,position_in_coref_chain,position_in_sentence,is_topic
np,anaphoric,no,defnp,referring,sbj,not_ambig,anaphor_nominal,text_level,2,1,-1,18,True
np,anaphoric,no,defnp,referring,sbj,not_ambig,anaphor_nominal,text_level,2,2,1,1,True
np,none,no,defnp,discourse-new,sbj,not_ambig,_unspecified_,text_level,2,1,-1,9,True

Where the first line is the header, followed by the data (using the preprocessor's LabelEncoder in my code to convert this to ints).

Now, when I do a

print(classifier.coef_)

I get

[[ 0.84768459 -0.56344453  0.00365928  0.21441586 -1.70290447 -0.18460676
   1.6167634   0.08556331  0.02152226 -0.05111953  0.07310608 -0.073653  ]]

which contains 12 columns/elements. I'm confused by this, since my data contains 13 columns (plus the 14th one with the label, I'm separating the features from the labels later on in my code). I was wondering if maybe sklearn expects/assumes the first column to be the id and doesn't actually use the value of this column? But I cannot find any info on this.

Any help here would be much appreciated!

Not sure how to edit my original question in a way that it would still make sense for future reference, so I'll post a minimal example here:

import pandas
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import f1_score
from collections import defaultdict
import numpy

headers = ['phrase_type','type','complex_np','np_form','referentiality','grammatical_role','ambiguity','anaphor_type','dir_speech','length_of_span','length_of_coref_chain','position_in_coref_chain','position_in_sentence','is_topic']
matrix = [
['np','none','no','no,pds','referring','dir-obj','not_ambig','_unspecified_','text_level','1','1','-1','1','True'],
['np','none','no','pds','not_specified','sbj','not_ambig','_unspecified_','text_level','1','1','-1','21','False'],
['np','none','yes','defnp','discourse-new','sbj','not_ambig','_unspecified_','text_level','8','1','-1','1','False'],
['np','none','yes','defnp','discourse-new','sbj','not_ambig','_unspecified_','text_level','8','2','0','6','False'],
['np','none','yes','defnp','discourse-new','sbj','not_ambig','_unspecified_','text_level','6','2','0','4','False'],
['np','none','yes','defnp','discourse-new','sbj','not_ambig','_unspecified_','text_level','21','1','-1','1','True'],
['np','anaphoric','no','ne','referring','other','not_ambig','anaphor_nominal','text_level','1','9','4','2','True'],
['np','anaphoric','no','defnp','referring','sbj','not_ambig','anaphor_nominal','text_level','3','9','5','1','True'],
['np','anaphoric','no','defnp','referring','sbj','not_ambig','anaphor_nominal','text_level','2','9','7','1','True'],
['np','anaphoric','no','pper','referring','sbj','not_ambig','anaphor_nominal','text_level','1','2','1','1','True'],
['np','anaphoric','no','ne','referring','sbj','not_ambig','anaphor_nominal','text_level','2','3','2','1','True'],
['np','anaphoric','no','pper','referring','sbj','not_ambig','anaphor_nominal','text_level','1','9','1','13','False'],
['np','none','no','defnp','discourse-new','sbj','not_ambig','_unspecified_','text_level','2','3','0','5','False'],
['np','none','yes','defnp','discourse-new','sbj','not_ambig','_unspecified_','text_level','6','1','-1','1','False'],
['np','none','no','ne','discourse-new','sbj','not_ambig','_unspecified_','text_level','2','9','0','1','False'],
['np','none','yes','defnp','discourse-new','sbj','not_ambig','_unspecified_','text_level','5','1','-1','5','False'],
['np','anaphoric','no','defnp','referring','sbj','not_ambig','anaphor_nominal','text_level','2','3','1','5','False'],
['np','none','no','defnp','discourse-new','sbj','not_ambig','_unspecified_','text_level','3','3','0','1','True'],
['np','anaphoric','no','pper','referring','sbj','not_ambig','anaphor_nominal','text_level','1','3','1','1','True'],
['np','anaphoric','no','pds','referring','sbj','not_ambig','anaphor_nominal','text_level','1','1','-1','2','True']
]


df = pandas.DataFrame(matrix, columns=headers)
d = defaultdict(LabelEncoder)
fit = df.apply(lambda x: d[x.name].fit_transform(x))
df = df.apply(lambda x: d[x.name].transform(x))

testrows = []
trainrows = []
splitIndex = len(matrix)/10
for index, row in df.iterrows():
    if index < splitIndex:
        testrows.append(row)
    else:
        trainrows.append(row)
testdf = pandas.DataFrame(testrows)
traindf = pandas.DataFrame(trainrows)
train_labels = traindf.is_topic
labels = list(set(train_labels))
train_labels = numpy.array([labels.index(x) for x in train_labels])
train_features = traindf.iloc[:,0:len(headers)-1]
train_features = numpy.array(train_features)
print('train features shape:', train_features.shape)
test_labels = testdf.is_topic
labels = list(set(test_labels))
test_labels = numpy.array([labels.index(x) for x in test_labels])
test_features = testdf.iloc[:,0:len(headers)-1]
test_features = numpy.array(test_features)

classifier = LogisticRegression()
classifier.fit(train_features, train_labels)
print(classifier.coef_)
results = classifier.predict(test_features)
f1 = f1_score(test_labels, results)
print(f1)

I think I may have found the source of the error (thanks @Alexey Trofimov for pointing me in the right direction). My code at first contained:

train_features = traindf.iloc[:,1:len(headers)-1]

Which was copied from another script, where I did have id's as the first column in my matrix, hence didn't want to take these into account. The len(headers)-1 then, if I understand things correctly, is to not take into account the actual label. Testing this in a real world scenario, deleting the -1 results in perfect f-score, which would make sense, since it would just only look at the actual label and always predict correctly... So I now changed this to

train_features = traindf.iloc[:,0:len(headers)-1]

as in the code snippet, and now get 13 columns (in X_train.shape, and consequently in classifier.coef_). I think this solved my issue, but am still not 100% convinced, so if someone could point out an error in this line of reasoning/my code above, I'd be grateful to hear about it.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM