简体   繁体   English

在sci-kit学习中使用分类预测变量

[英]Using Categorical Predictor Variables in sci-kit learn

Basic question here: 这里的基本问题:

I'm trying to implement a simple classification model for credit card default where I just use model.fit , model.predict on my input data. 我正在尝试为信用卡默认实现一个简单的分类模型,在该模型中,我只对输入数据使用model.fitmodel.predict However, that input data contains both categorical data (like demographic information like Age, Married or Not, Education level) and continuous data (like credit balances). 但是,该输入数据既包含类别数据(例如人口统计信息,例如,年龄,已婚或未婚,教育程度),又包含连续数据(例如,贷方余额)。

data.info()

 <div class="output"><div class="output_area"><div class="run_this_cell"></div><div class="prompt"></div><div class="output_subarea output_text output_stream output_stdout"><pre>&lt;class 'pandas.core.frame.DataFrame'&gt; Int64Index: 30000 entries, 1 to 30000 Data columns (total 24 columns): LIMIT_BAL 30000 non-null float64 SEX 30000 non-null int64 EDUCATION 30000 non-null int64 MARRIAGE 30000 non-null int64 AGE 30000 non-null int64 PAY_1 30000 non-null int64 PAY_2 30000 non-null int64 PAY_3 30000 non-null int64 PAY_4 30000 non-null int64 PAY_5 30000 non-null int64 PAY_6 30000 non-null int64 BILL_AMT1 30000 non-null float64 BILL_AMT2 30000 non-null float64 BILL_AMT3 30000 non-null float64 BILL_AMT4 30000 non-null float64 BILL_AMT5 30000 non-null float64 BILL_AMT6 30000 non-null float64 PAY_AMT1 30000 non-null float64 PAY_AMT2 30000 non-null float64 PAY_AMT3 30000 non-null float64 PAY_AMT4 30000 non-null float64 PAY_AMT5 30000 non-null float64 PAY_AMT6 30000 non-null float64 default 30000 non-null int64 dtypes: float64(13), int64(11) memory usage: 5.7 MB </pre></div></div></div> 

From my understanding, scikit-learn requires all data to be numerical and continuous or specifically coded as a categorical variable. 根据我的理解,scikit-learn要求所有数据都是数字和连续的,或者专门编码为分类变量。 The numerical part is not a problem since all of my data is coded numerically (like 0 for Married, 1 for not) but 3 of my variables (SEX, EDUCATION, and MARRIAGE) are nominal/ordinal and need to be encoded as categorical variables instead of int64 ones. 数字部分不是问题,因为我的所有数据都是数字编码的(例如0代表已婚,1代表非婚姻),但是我的3个变量(SEX,Education和MARRIAGE)是标称/有序的,需要编码为类别变量而不是int64。

How do I use encode these 3 variables with scikit-learn's preprocessing module to properly feed these features into a model like Logistic Regression? 如何使用scikit-learn的预处理模块对这3个变量进行编码,以将这些功能正确地输入模型(如Logistic回归)中?

Thanks in advance, and please forgive the formatting (feel free to edit or recommend how I can properly include Jupyter Notebook output into a Stack Overflow post). 在此先感谢您,并请原谅其格式(随时进行编辑或建议我如何将Jupyter Notebook输出正确地包含在Stack Overflow帖子中)。

Categorical features need more attention in feature engineering, because features like Age, date etc are difficult to encode. 分类特征在特征工程中需要更多关注,因为诸如年龄,日期等特征很难编码。 There are many ways to encode these features, by analyzing, domain-knowledge and many more. 有很多方法可以通过分析,域知识等对这些功能进行编码。

There is a library category_encoders , which have many functionality to encode such features, by the use of statistics. 有一个category_encoders库,它具有许多功能,可通过使用统计信息对这些功能进行编码。 More you can find here http://contrib.scikit-learn.org/categorical-encoding/ 您可以在这里找到更多信息http://contrib.scikit-learn.org/categorical-encoding/

Here, is another good resource , that will shows you the use of encoding method by an example. 这里是另一个很好的资源 ,将通过示例向您展示编码方法的使用。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM