在sci-kit学习中使用分类预测变量

Question

Basic question here: 这里的基本问题：

I'm trying to implement a simple classification model for credit card default where I just use model.fit , model.predict on my input data. 我正在尝试为信用卡默认实现一个简单的分类模型，在该模型中，我只对输入数据使用model.fit ， model.predict 。 However, that input data contains both categorical data (like demographic information like Age, Married or Not, Education level) and continuous data (like credit balances). 但是，该输入数据既包含类别数据（例如人口统计信息，例如，年龄，已婚或未婚，教育程度），又包含连续数据（例如，贷方余额）。

data.info()

 <div class="output"><div class="output_area"><div class="run_this_cell"></div><div class="prompt"></div><div class="output_subarea output_text output_stream output_stdout"><pre>&lt;class 'pandas.core.frame.DataFrame'&gt; Int64Index: 30000 entries, 1 to 30000 Data columns (total 24 columns): LIMIT_BAL 30000 non-null float64 SEX 30000 non-null int64 EDUCATION 30000 non-null int64 MARRIAGE 30000 non-null int64 AGE 30000 non-null int64 PAY_1 30000 non-null int64 PAY_2 30000 non-null int64 PAY_3 30000 non-null int64 PAY_4 30000 non-null int64 PAY_5 30000 non-null int64 PAY_6 30000 non-null int64 BILL_AMT1 30000 non-null float64 BILL_AMT2 30000 non-null float64 BILL_AMT3 30000 non-null float64 BILL_AMT4 30000 non-null float64 BILL_AMT5 30000 non-null float64 BILL_AMT6 30000 non-null float64 PAY_AMT1 30000 non-null float64 PAY_AMT2 30000 non-null float64 PAY_AMT3 30000 non-null float64 PAY_AMT4 30000 non-null float64 PAY_AMT5 30000 non-null float64 PAY_AMT6 30000 non-null float64 default 30000 non-null int64 dtypes: float64(13), int64(11) memory usage: 5.7 MB </pre></div></div></div>

From my understanding, scikit-learn requires all data to be numerical and continuous or specifically coded as a categorical variable. 根据我的理解，scikit-learn要求所有数据都是数字和连续的，或者专门编码为分类变量。 The numerical part is not a problem since all of my data is coded numerically (like 0 for Married, 1 for not) but 3 of my variables (SEX, EDUCATION, and MARRIAGE) are nominal/ordinal and need to be encoded as categorical variables instead of int64 ones. 数字部分不是问题，因为我的所有数据都是数字编码的（例如0代表已婚，1代表非婚姻），但是我的3个变量（SEX，Education和MARRIAGE）是标称/有序的，需要编码为类别变量而不是int64。

How do I use encode these 3 variables with scikit-learn's preprocessing module to properly feed these features into a model like Logistic Regression? 如何使用scikit-learn的预处理模块对这3个变量进行编码，以将这些功能正确地输入模型（如Logistic回归）中？

Thanks in advance, and please forgive the formatting (feel free to edit or recommend how I can properly include Jupyter Notebook output into a Stack Overflow post). 在此先感谢您，并请原谅其格式（随时进行编辑或建议我如何将Jupyter Notebook输出正确地包含在Stack Overflow帖子中）。

Answer 1

Categorical features need more attention in feature engineering, because features like Age, date etc are difficult to encode. 分类特征在特征工程中需要更多关注，因为诸如年龄，日期等特征很难编码。 There are many ways to encode these features, by analyzing, domain-knowledge and many more. 有很多方法可以通过分析，域知识等对这些功能进行编码。

There is a library category_encoders , which have many functionality to encode such features, by the use of statistics. 有一个category_encoders库，它具有许多功能，可通过使用统计信息对这些功能进行编码。 More you can find here http://contrib.scikit-learn.org/categorical-encoding/ 您可以在这里找到更多信息http://contrib.scikit-learn.org/categorical-encoding/

Here, is another good resource , that will shows you the use of encoding method by an example. 这里是另一个很好的资源，将通过示例向您展示编码方法的使用。

在sci-kit学习中使用分类预测变量

问题描述

1 个解决方案

解决方案1
2 已采纳 2019-01-11 05:21:26

在sci-kit学习中使用分类预测变量

问题描述

1 个解决方案

解决方案1 2 已采纳 2019-01-11 05:21:26

解决方案1
2 已采纳 2019-01-11 05:21:26