简体   繁体   中英

How to do linear regression using Python and Scikit learn using one hot encoding?

I am trying to use linear regression in combination with python and scikitlearn to answer the question "can user session lengths be predicted given user demographic information?"

I am using linear regression because the user session lengths are in milliseconds, which is continuous. I one hot encoded all of my categorical variables including gender, country, and age range.

I am not sure how to take into account my one hot encoding, or if I even need to.

Input Data:

在此处输入图片说明

I tried reading here: http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

I understand the inputs is my main are whether to calculate a fit intercept, normalize, copy x (all boolean), and then n jobs.

I'm not sure what factors to take into account when deciding on these inputs. I'm also concerned whether my one hot encoding of the variables makes an impact.

You can do like:

from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LinearRegression

# X is a numpy array with your features
# y is the label array
enc = OneHotEncoder(sparse=False)
X_transform = enc.fit_transform(X)

# apply your linear regression as you want
model = LinearRegression()
model.fit(X_transform, y)

print("Mean squared error: %.2f" % np.mean((model.predict(X_transform) - y) ** 2))

Please note that this example I am training and testing with the same dataset! This may cause an overfit in your model. You should avoid that splitting the data or doing cross-validation.

I just wanted to fit a linear regression with sklearn which I use as benchmark for other non-linear approaches, such as MLPRegressor, but also variations of linear regression, such as Ridge, Lasso and ElasticNet (see here for an introduction to this group: http://scikit-learn.org/stable/modules/linear_model.html ).

Doing it the same ways as described by @silviomoreto (which worked for all other models) actually for me resulted in an errogenous model (very high errors). This is most likely due to the so called dummy variable trap, which occurs due to multicollinearity in the variables when you include one dummy variable per category for categoric variables -- which is exactly what OneHotEncoder does! See also the following discussion on statsexchange: https://stats.stackexchange.com/questions/224051/one-hot-vs-dummy-encoding-in-scikit-learn .

To avoid this, I wrote a simple wrapper that excludes one variable, which then acts as the default.

class DummyEncoder(BaseEstimator, TransformerMixin):

    def __init__(self, n_values='auto'):
        self.n_values = n_values

    def transform(self, X):
        ohe = OneHotEncoder(sparse=False, n_values=self.n_values)
        return ohe.fit_transform(X)[:,:-1]

    def fit(self, X, y=None, **fit_params):
        return self

So building on the code of @silviomoreto, you would change line 6:

enc = DummyEncoder()

This solved the problem for me. Note that OneHotEncoder worked fine (and better) for all other models, such as Ridge, Lasso and ANN.

I chose this way, because I wanted to include it in my feature pipeline. But you seem to have the data already encoded. Here, you would have to drop one column per category (eg for male/female only include one). So if you for example used pandas.get_dummies(...), this can be done with the parameter drop_first=True.

Last but not least, if you really need to go deeper into linear regression in Python, and not use it just as a benchmark, I would recommend statsmodels over scikit-learn ( https://pypi.python.org/pypi/statsmodels ), as it provides better model statistics, eg p-values per variable, etc.

how to prepare data for sklearn LinearRegression

OneHotEncode should only be used on the intended columns: those with categorical variables or strings, or integers that are essentially levels rather than numeric.

DO NOT apply OneHotEncode to your entire dataset including numerical variable or Booleans.

To prepare the data for sklearn LinearRegression, the numerical and categorical should be separately handled.

  • numerical columns: standardize if your model contains interactions or polynomial terms
  • categorical columns: apply OneHot either through sklearn or pd.get_dummies. pd.get_dummies is more flexible while OneHotEncode is more consistent in working with sklearn API.

drop='first'

As of version 0.22, OneHotEncoder in sklearn has drop option. For example OneHotEncoder(drop='first').fit(X) , which is similar to pd.get_dummies(drop_first=True) .

use regularized linear regression

If you use regularized linear regression such as Lasso, multicollinear variables will be penalized and shrunk.

limitation of p-value statistics

The p-value in OLS is only valid when the OLS assumptions are more or less true. While there are methods to deal with situations when p-values cannot be trusted, one potential solution is to use cross validation or leave-one-out for gaining confidence on the model.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM