简体   繁体   中英

Using Categorical Predictor Variables in sci-kit learn

Basic question here:

I'm trying to implement a simple classification model for credit card default where I just use model.fit , model.predict on my input data. However, that input data contains both categorical data (like demographic information like Age, Married or Not, Education level) and continuous data (like credit balances).

data.info()

 <div class="output"><div class="output_area"><div class="run_this_cell"></div><div class="prompt"></div><div class="output_subarea output_text output_stream output_stdout"><pre>&lt;class 'pandas.core.frame.DataFrame'&gt; Int64Index: 30000 entries, 1 to 30000 Data columns (total 24 columns): LIMIT_BAL 30000 non-null float64 SEX 30000 non-null int64 EDUCATION 30000 non-null int64 MARRIAGE 30000 non-null int64 AGE 30000 non-null int64 PAY_1 30000 non-null int64 PAY_2 30000 non-null int64 PAY_3 30000 non-null int64 PAY_4 30000 non-null int64 PAY_5 30000 non-null int64 PAY_6 30000 non-null int64 BILL_AMT1 30000 non-null float64 BILL_AMT2 30000 non-null float64 BILL_AMT3 30000 non-null float64 BILL_AMT4 30000 non-null float64 BILL_AMT5 30000 non-null float64 BILL_AMT6 30000 non-null float64 PAY_AMT1 30000 non-null float64 PAY_AMT2 30000 non-null float64 PAY_AMT3 30000 non-null float64 PAY_AMT4 30000 non-null float64 PAY_AMT5 30000 non-null float64 PAY_AMT6 30000 non-null float64 default 30000 non-null int64 dtypes: float64(13), int64(11) memory usage: 5.7 MB </pre></div></div></div> 

From my understanding, scikit-learn requires all data to be numerical and continuous or specifically coded as a categorical variable. The numerical part is not a problem since all of my data is coded numerically (like 0 for Married, 1 for not) but 3 of my variables (SEX, EDUCATION, and MARRIAGE) are nominal/ordinal and need to be encoded as categorical variables instead of int64 ones.

How do I use encode these 3 variables with scikit-learn's preprocessing module to properly feed these features into a model like Logistic Regression?

Thanks in advance, and please forgive the formatting (feel free to edit or recommend how I can properly include Jupyter Notebook output into a Stack Overflow post).

Categorical features need more attention in feature engineering, because features like Age, date etc are difficult to encode. There are many ways to encode these features, by analyzing, domain-knowledge and many more.

There is a library category_encoders , which have many functionality to encode such features, by the use of statistics. More you can find here http://contrib.scikit-learn.org/categorical-encoding/

Here, is another good resource , that will shows you the use of encoding method by an example.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM