![](/img/trans.png)
[英]How to make dummy coding (pd.get_dummies()) only for categories which share in nominal variables is at least 40% in Python Pandas?
[英]How to make dummy columns only on variables witch appropriate number of categories and suffisant share category in column?
我在 Python Pandas 中有 DataFrame,如下所示(两种类型的列:数字和对象):
数据类型:
列1 | 列2 | 列3 | ... | 列 |
---|---|---|---|---|
111 | 一种 | 是 | ... | ... |
222 | 一种 | 是 | ... | ... |
333 | 乙 | Z | ... | ... |
444 | C | Z | ... | ... |
555 | 丁 | P | ... | ... |
我只需要对具有以下特征的分类变量进行虚拟编码( pandas.get_dummies()
):
所以,例如:
因此,结果我需要如下内容:
COL1 | COL3_Y | COL3_Z | ... | COLn
-----|--------|--------|------|------
111 | 1 | 0 | ... | ...
222 | 1 | 0 | ... | ...
333 | 0 | 1 | ... | ...
444 | 0 | 1 | ... | ...
555 | 0 | 0 | ... | ...
与以下玩具 dataframe:
import pandas as pd
df = pd.DataFrame(
{
"COL1": [111, 222, 333, 444, 555],
"COL2": ["A", "A", "B", "C", "D"],
"COL3": ["Y", "Y", "Z", "Z", "P"],
"COL4": ["U", "U", "W", "V", "V"],
}
)
这是一种方法:
# Setup
new_df = df["COL1"]
s = df.nunique()
# Filter out rows with too many categories
tmp = df.loc[:, s[s <= 3].index]
# Filter out values with insuffisant percentage
# Get dummies and concat new columns
for col in tmp.columns:
frq = tmp[col].value_counts() / tmp.shape[0]
other_tmp = tmp[col]
other_tmp = other_tmp[
other_tmp.isin(frq[frq >= 0.4].index.get_level_values(0).tolist())
]
other_tmp = pd.get_dummies(other_tmp)
new_df = pd.concat([new_df, other_tmp], axis=1)
# Cleanup
new_df = new_df.fillna(0).astype(int)
然后:
print(new_df)
# Output
COL1 Y Z U V
0 111 1 0 1 0
1 222 1 0 1 0
2 333 0 1 0 0
3 444 0 1 0 1
4 555 0 0 0 1
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.