简体   繁体   中英

scikit-learn logistic regression feature importance

I'm looking for a way to get an idea of the impact of the features I'm using in a classification problem. Using sklearn's logistic regression classifier ( http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html ), I understood that the .coef_ attribute gets me the information I'm after (as also discussed in this thread: How to find the importance of the features for a logistic regression model? ).

The first few lines of my matrix:


Where the first line is the header, followed by the data (using the preprocessor's LabelEncoder in my code to convert this to ints).

Now, when I do a


I get

[[ 0.84768459 -0.56344453  0.00365928  0.21441586 -1.70290447 -0.18460676
   1.6167634   0.08556331  0.02152226 -0.05111953  0.07310608 -0.073653  ]]

which contains 12 columns/elements. I'm confused by this, since my data contains 13 columns (plus the 14th one with the label, I'm separating the features from the labels later on in my code). I was wondering if maybe sklearn expects/assumes the first column to be the id and doesn't actually use the value of this column? But I cannot find any info on this.

Any help here would be much appreciated!

Not sure how to edit my original question in a way that it would still make sense for future reference, so I'll post a minimal example here:

import pandas
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import f1_score
from collections import defaultdict
import numpy

headers = ['phrase_type','type','complex_np','np_form','referentiality','grammatical_role','ambiguity','anaphor_type','dir_speech','length_of_span','length_of_coref_chain','position_in_coref_chain','position_in_sentence','is_topic']
matrix = [

df = pandas.DataFrame(matrix, columns=headers)
d = defaultdict(LabelEncoder)
fit = df.apply(lambda x: d[x.name].fit_transform(x))
df = df.apply(lambda x: d[x.name].transform(x))

testrows = []
trainrows = []
splitIndex = len(matrix)/10
for index, row in df.iterrows():
    if index < splitIndex:
testdf = pandas.DataFrame(testrows)
traindf = pandas.DataFrame(trainrows)
train_labels = traindf.is_topic
labels = list(set(train_labels))
train_labels = numpy.array([labels.index(x) for x in train_labels])
train_features = traindf.iloc[:,0:len(headers)-1]
train_features = numpy.array(train_features)
print('train features shape:', train_features.shape)
test_labels = testdf.is_topic
labels = list(set(test_labels))
test_labels = numpy.array([labels.index(x) for x in test_labels])
test_features = testdf.iloc[:,0:len(headers)-1]
test_features = numpy.array(test_features)

classifier = LogisticRegression()
classifier.fit(train_features, train_labels)
results = classifier.predict(test_features)
f1 = f1_score(test_labels, results)

I think I may have found the source of the error (thanks @Alexey Trofimov for pointing me in the right direction). My code at first contained:

train_features = traindf.iloc[:,1:len(headers)-1]

Which was copied from another script, where I did have id's as the first column in my matrix, hence didn't want to take these into account. The len(headers)-1 then, if I understand things correctly, is to not take into account the actual label. Testing this in a real world scenario, deleting the -1 results in perfect f-score, which would make sense, since it would just only look at the actual label and always predict correctly... So I now changed this to

train_features = traindf.iloc[:,0:len(headers)-1]

as in the code snippet, and now get 13 columns (in X_train.shape, and consequently in classifier.coef_). I think this solved my issue, but am still not 100% convinced, so if someone could point out an error in this line of reasoning/my code above, I'd be grateful to hear about it.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM