在sci-kit学习中使用分类预测变量

Question

这里的基本问题：

我正在尝试为信用卡默认实现一个简单的分类模型，在该模型中，我只对输入数据使用model.fit ， model.predict 。 但是，该输入数据既包含类别数据（例如人口统计信息，例如，年龄，已婚或未婚，教育程度），又包含连续数据（例如，贷方余额）。

data.info()

 <div class="output"><div class="output_area"><div class="run_this_cell"></div><div class="prompt"></div><div class="output_subarea output_text output_stream output_stdout"><pre>&lt;class 'pandas.core.frame.DataFrame'&gt; Int64Index: 30000 entries, 1 to 30000 Data columns (total 24 columns): LIMIT_BAL 30000 non-null float64 SEX 30000 non-null int64 EDUCATION 30000 non-null int64 MARRIAGE 30000 non-null int64 AGE 30000 non-null int64 PAY_1 30000 non-null int64 PAY_2 30000 non-null int64 PAY_3 30000 non-null int64 PAY_4 30000 non-null int64 PAY_5 30000 non-null int64 PAY_6 30000 non-null int64 BILL_AMT1 30000 non-null float64 BILL_AMT2 30000 non-null float64 BILL_AMT3 30000 non-null float64 BILL_AMT4 30000 non-null float64 BILL_AMT5 30000 non-null float64 BILL_AMT6 30000 non-null float64 PAY_AMT1 30000 non-null float64 PAY_AMT2 30000 non-null float64 PAY_AMT3 30000 non-null float64 PAY_AMT4 30000 non-null float64 PAY_AMT5 30000 non-null float64 PAY_AMT6 30000 non-null float64 default 30000 non-null int64 dtypes: float64(13), int64(11) memory usage: 5.7 MB </pre></div></div></div>

根据我的理解，scikit-learn要求所有数据都是数字和连续的，或者专门编码为分类变量。 数字部分不是问题，因为我的所有数据都是数字编码的（例如0代表已婚，1代表非婚姻），但是我的3个变量（SEX，Education和MARRIAGE）是标称/有序的，需要编码为类别变量而不是int64。

如何使用scikit-learn的预处理模块对这3个变量进行编码，以将这些功能正确地输入模型（如Logistic回归）中？

在此先感谢您，并请原谅其格式（随时进行编辑或建议我如何将Jupyter Notebook输出正确地包含在Stack Overflow帖子中）。

Answer 1

分类特征在特征工程中需要更多关注，因为诸如年龄，日期等特征很难编码。 有很多方法可以通过分析，域知识等对这些功能进行编码。

有一个category_encoders库，它具有许多功能，可通过使用统计信息对这些功能进行编码。 您可以在这里找到更多信息http://contrib.scikit-learn.org/categorical-encoding/

这里是另一个很好的资源，将通过示例向您展示编码方法的使用。

在sci-kit学习中使用分类预测变量

问题描述

1 个解决方案

解决方案1
2 已采纳 2019-01-11 05:21:26

在sci-kit学习中使用分类预测变量

问题描述

1 个解决方案

解决方案1 2 已采纳 2019-01-11 05:21:26

解决方案1
2 已采纳 2019-01-11 05:21:26