在sci-kit學習中使用分類預測變量

Question

這里的基本問題：

我正在嘗試為信用卡默認實現一個簡單的分類模型，在該模型中，我只對輸入數據使用model.fit ， model.predict 。 但是，該輸入數據既包含類別數據（例如人口統計信息，例如，年齡，已婚或未婚，教育程度），又包含連續數據（例如，貸方余額）。

data.info()

 <div class="output"><div class="output_area"><div class="run_this_cell"></div><div class="prompt"></div><div class="output_subarea output_text output_stream output_stdout"><pre>&lt;class 'pandas.core.frame.DataFrame'&gt; Int64Index: 30000 entries, 1 to 30000 Data columns (total 24 columns): LIMIT_BAL 30000 non-null float64 SEX 30000 non-null int64 EDUCATION 30000 non-null int64 MARRIAGE 30000 non-null int64 AGE 30000 non-null int64 PAY_1 30000 non-null int64 PAY_2 30000 non-null int64 PAY_3 30000 non-null int64 PAY_4 30000 non-null int64 PAY_5 30000 non-null int64 PAY_6 30000 non-null int64 BILL_AMT1 30000 non-null float64 BILL_AMT2 30000 non-null float64 BILL_AMT3 30000 non-null float64 BILL_AMT4 30000 non-null float64 BILL_AMT5 30000 non-null float64 BILL_AMT6 30000 non-null float64 PAY_AMT1 30000 non-null float64 PAY_AMT2 30000 non-null float64 PAY_AMT3 30000 non-null float64 PAY_AMT4 30000 non-null float64 PAY_AMT5 30000 non-null float64 PAY_AMT6 30000 non-null float64 default 30000 non-null int64 dtypes: float64(13), int64(11) memory usage: 5.7 MB </pre></div></div></div>

根據我的理解，scikit-learn要求所有數據都是數字和連續的，或者專門編碼為分類變量。 數字部分不是問題，因為我的所有數據都是數字編碼的（例如0代表已婚，1代表非婚姻），但是我的3個變量（SEX，Education和MARRIAGE）是標稱/有序的，需要編碼為類別變量而不是int64。

如何使用scikit-learn的預處理模塊對這3個變量進行編碼，以將這些功能正確地輸入模型（如Logistic回歸）中？

在此先感謝您，並請原諒其格式（隨時進行編輯或建議我如何將Jupyter Notebook輸出正確地包含在Stack Overflow帖子中）。

Answer 1

分類特征在特征工程中需要更多關注，因為諸如年齡，日期等特征很難編碼。 有很多方法可以通過分析，域知識等對這些功能進行編碼。

有一個category_encoders庫，它具有許多功能，可通過使用統計信息對這些功能進行編碼。 您可以在這里找到更多信息http://contrib.scikit-learn.org/categorical-encoding/

這里是另一個很好的資源，將通過示例向您展示編碼方法的使用。

在sci-kit學習中使用分類預測變量

問題描述

1 個解決方案

解決方案1
2 已采納 2019-01-11 05:21:26

在sci-kit學習中使用分類預測變量

問題描述

1 個解決方案

解決方案1 2 已采納 2019-01-11 05:21:26

解決方案1
2 已采納 2019-01-11 05:21:26