Using Categorical Predictor Variables in sci-kit learn

Question

Basic question here:

I'm trying to implement a simple classification model for credit card default where I just use model.fit , model.predict on my input data. However, that input data contains both categorical data (like demographic information like Age, Married or Not, Education level) and continuous data (like credit balances).

data.info()

 <div class="output"><div class="output_area"><div class="run_this_cell"></div><div class="prompt"></div><div class="output_subarea output_text output_stream output_stdout"><pre>&lt;class 'pandas.core.frame.DataFrame'&gt; Int64Index: 30000 entries, 1 to 30000 Data columns (total 24 columns): LIMIT_BAL 30000 non-null float64 SEX 30000 non-null int64 EDUCATION 30000 non-null int64 MARRIAGE 30000 non-null int64 AGE 30000 non-null int64 PAY_1 30000 non-null int64 PAY_2 30000 non-null int64 PAY_3 30000 non-null int64 PAY_4 30000 non-null int64 PAY_5 30000 non-null int64 PAY_6 30000 non-null int64 BILL_AMT1 30000 non-null float64 BILL_AMT2 30000 non-null float64 BILL_AMT3 30000 non-null float64 BILL_AMT4 30000 non-null float64 BILL_AMT5 30000 non-null float64 BILL_AMT6 30000 non-null float64 PAY_AMT1 30000 non-null float64 PAY_AMT2 30000 non-null float64 PAY_AMT3 30000 non-null float64 PAY_AMT4 30000 non-null float64 PAY_AMT5 30000 non-null float64 PAY_AMT6 30000 non-null float64 default 30000 non-null int64 dtypes: float64(13), int64(11) memory usage: 5.7 MB </pre></div></div></div>

From my understanding, scikit-learn requires all data to be numerical and continuous or specifically coded as a categorical variable. The numerical part is not a problem since all of my data is coded numerically (like 0 for Married, 1 for not) but 3 of my variables (SEX, EDUCATION, and MARRIAGE) are nominal/ordinal and need to be encoded as categorical variables instead of int64 ones.

How do I use encode these 3 variables with scikit-learn's preprocessing module to properly feed these features into a model like Logistic Regression?

Thanks in advance, and please forgive the formatting (feel free to edit or recommend how I can properly include Jupyter Notebook output into a Stack Overflow post).

Answer 1

Categorical features need more attention in feature engineering, because features like Age, date etc are difficult to encode. There are many ways to encode these features, by analyzing, domain-knowledge and many more.

There is a library category_encoders , which have many functionality to encode such features, by the use of statistics. More you can find here http://contrib.scikit-learn.org/categorical-encoding/

Here, is another good resource , that will shows you the use of encoding method by an example.

Using Categorical Predictor Variables in sci-kit learn

Question

1 answers

solution1
2 ACCPTED 2019-01-11 05:21:26

Using Categorical Predictor Variables in sci-kit learn

Question

1 answers

solution1 2 ACCPTED 2019-01-11 05:21:26

solution1
2 ACCPTED 2019-01-11 05:21:26