简体   繁体   中英

Linear Regression - Get Feature Importance using MinMaxScaler() - Extremely large coefficients

I'm trying to get the feature importances for a Regression model. I have 58 independent variables and one dependent variables. Most of the independent variables are numerical and some are binary.

First I used this:

X = dataset.drop(['y'], axis=1)
y = dataset[['y']]

# define the model
model = LinearRegression()
# fit the model
model.fit(X, y)
# get importance
importance = model.coef_[0]
print(model.coef_)
print(importance)
# summarize feature importance
for i,v in enumerate(importance):
    print('Feature: %0d, Score: %.5f' % (i,v))
# plot feature importance
pyplot.bar([x for x in range(len(importance))], importance)
pyplot.show()

and got the following results: Feature Importance Plot

Then I used MinMaxScaler() to scale the data before fitting the model:

scaler = MinMaxScaler()
dataset[dataset.columns] = scaler.fit_transform(dataset[dataset.columns])
print(dataset)

X = dataset.drop(['y'], axis=1)
y = dataset[['y']]

# define the model
model = LinearRegression()
# fit the model
model.fit(X, y)
# get importance
importance = model.coef_[0]
print(model.coef_)
print(importance)
# summarize feature importance
for i,v in enumerate(importance):
    print('Feature: %0d, Score: %.5f' % (i,v))
# plot feature importance
pyplot.bar([x for x in range(len(importance))], importance)
pyplot.show()

which led to the following plot: Feature Importance Plot after using MinMaxScaler

As you can see in the upper left corner it is 1e11, which means the largest values are negative 60 billion. What am I doing wrong here? And is it even the right approach to use MinMaxScaler?

In regression analysis, the magnitude of your coefficients is not necessarily related to their importance. The most common criteria to determine the importance of independent variables in regression analysis are p-values. Small p-values imply high levels of importance, whereas high p-values mean that a variable is not statistically significant. You should only use the magnitude of coefficients as a measure for feature importance when your model is penalizing variables. That is, when the optimization problem has L1 or L2 penalties, like lasso or ridge regressions.

sklearn does not report p-values though. I recommend running the same regression using statsmodels.OLS . For all other models, including trees, ensembles, neural networks, etc., you should use feature_importances_ to determine the individual importance of each independent variable.

By using model.coef_ as a measure of feature importance, you are only taking into account the magnitude of the betas. If this really is what you are interested in, try numpy.abs(model.coef_[0]) , because betas can be negative too.

As for your use of min_max_scaler() , you are using it correctly. However, you are transforming the entire dataset, when really, you are only supposed to re-scale your independent variables.

X = dataset.drop(['y'], axis=1)
y = dataset['y']
scaler = MinMaxScaler()
X = scaler.fit_transform(X)
print(X)

By using scaler.fit_transform(dataset[dataset.columns]) you were rescaling ALL the columns in your dataset object, including your dependent variable. In fact, your code is equivalent to scaler.fit_transform(dataset) , as you were selecting all the columns in dataset .

Typically, you should only re-scale your data if you suspect that outliers are affecting your estimator. By re-scaling your data, the beta coefficients are no longer interpretable (or at least not as intuitive). This happens because a given beta no longer indicates the change in the dependent variable caused by a marginal change in the corresponding independent variable.

Finally, this should not be an issue, but just to be safe, make sure that the scaler is not changing your binary independent variables.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM