[英]How to make dummy coding (pd.get_dummies()) only for categories which share in nominal variables is at least 40% in Python Pandas?
[英]How to make dummy columns only on variables witch appropriate number of categories and suffisant share category in column?
我在 Python Pandas 中有 DataFrame,如下所示(兩種類型的列:數字和對象):
數據類型:
列1 | 列2 | 列3 | ... | 列 |
---|---|---|---|---|
111 | 一種 | 是 | ... | ... |
222 | 一種 | 是 | ... | ... |
333 | 乙 | Z | ... | ... |
444 | C | Z | ... | ... |
555 | 丁 | P | ... | ... |
我只需要對具有以下特征的分類變量進行虛擬編碼( pandas.get_dummies()
):
所以,例如:
因此,結果我需要如下內容:
COL1 | COL3_Y | COL3_Z | ... | COLn
-----|--------|--------|------|------
111 | 1 | 0 | ... | ...
222 | 1 | 0 | ... | ...
333 | 0 | 1 | ... | ...
444 | 0 | 1 | ... | ...
555 | 0 | 0 | ... | ...
與以下玩具 dataframe:
import pandas as pd
df = pd.DataFrame(
{
"COL1": [111, 222, 333, 444, 555],
"COL2": ["A", "A", "B", "C", "D"],
"COL3": ["Y", "Y", "Z", "Z", "P"],
"COL4": ["U", "U", "W", "V", "V"],
}
)
這是一種方法:
# Setup
new_df = df["COL1"]
s = df.nunique()
# Filter out rows with too many categories
tmp = df.loc[:, s[s <= 3].index]
# Filter out values with insuffisant percentage
# Get dummies and concat new columns
for col in tmp.columns:
frq = tmp[col].value_counts() / tmp.shape[0]
other_tmp = tmp[col]
other_tmp = other_tmp[
other_tmp.isin(frq[frq >= 0.4].index.get_level_values(0).tolist())
]
other_tmp = pd.get_dummies(other_tmp)
new_df = pd.concat([new_df, other_tmp], axis=1)
# Cleanup
new_df = new_df.fillna(0).astype(int)
然后:
print(new_df)
# Output
COL1 Y Z U V
0 111 1 0 1 0
1 222 1 0 1 0
2 333 0 1 0 0
3 444 0 1 0 1
4 555 0 0 0 1
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.