简体   繁体   中英

Python Regression Variable Selection

I have a basic linear regression with 80 numerical variables (no classification variables). Training set has 1600 rows, testing 700.

I would like a python package that iterates through all column combinations to find the best custom score function or an out of the box score funtion like AIC. OR If that doesnt exist, what do people here use for variable selection? I know R has some packages like this but dont want deal with Rpy2

I have no preference if the LM requires scikit learn, numpy, pandas, statsmodels, or other.

I can suggest an answer that using the Least Absolute Shrinkage and Selection Operator(Lasso). I didn't use in a situation like you, that you have to deal with so many data.

http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html

I often write a code to do linear regression with statsmodels like below,

import statsmodels.api as sm

model = sm.OLS()
results = model.fit(train_X,train_Y)

If I want to do Lasso regression, I write a code like below,

from sklearn import linear_model

model = linear_model.Lasso(alpha=1.0(default))
results = model.fit(train_X,train_Y)

You have to decide appropriate alpha between 0.0 and 1.0. The parameter is determined by how you don't accept the error.

Try this.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM