如何仅在具有适当数量的类别和列中足够的份额类别的变量上制作虚拟列？

Question

我在 Python Pandas 中有 DataFrame，如下所示（两种类型的列：数字和对象）：

数据类型：

COL1 - 数字
COL2 - object
COL3 - object

列1	列2	列3	...	列
111	一种	是	...	...
222	一种	是	...	...
333	乙	Z	...	...
444	C	Z	...	...
555	丁	P	...	...

我只需要对具有以下特征的分类变量进行虚拟编码（ pandas.get_dummies() ）：

变量中最多 3 个类别
类别所占变量的最小百分比为 0.4

所以，例如：

COL2 不符合要求 nr。 1（有4个不同的类别：A，B，C，D），所以去掉它
在 COL3 中类别“P”不符合要求 nr.2（份额为 1/5 = 0.2），因此仅使用类别“Y”和“Z”进行虚拟编码

因此，结果我需要如下内容：

COL1 | COL3_Y | COL3_Z | ...  | COLn
-----|--------|--------|------|------
111  | 1      | 0      | ...  | ...
222  | 1      | 0      | ...  | ...
333  | 0      | 1      | ...  | ...
444  | 0      | 1      | ...  | ...
555  | 0      | 0      | ...  | ...

Answer 1

与以下玩具 dataframe：

import pandas as pd

df = pd.DataFrame(
    {
        "COL1": [111, 222, 333, 444, 555],
        "COL2": ["A", "A", "B", "C", "D"],
        "COL3": ["Y", "Y", "Z", "Z", "P"],
        "COL4": ["U", "U", "W", "V", "V"],
    }
)

这是一种方法：

# Setup
new_df = df["COL1"]
s = df.nunique()

# Filter out rows with too many categories
tmp = df.loc[:, s[s <= 3].index]

# Filter out values with insuffisant percentage
# Get dummies and concat new columns
for col in tmp.columns:
    frq = tmp[col].value_counts() / tmp.shape[0]
    other_tmp = tmp[col]
    other_tmp = other_tmp[
        other_tmp.isin(frq[frq >= 0.4].index.get_level_values(0).tolist())
    ]
    other_tmp = pd.get_dummies(other_tmp)
    new_df = pd.concat([new_df, other_tmp], axis=1)

# Cleanup
new_df = new_df.fillna(0).astype(int)

然后：

print(new_df)
# Output
   COL1  Y  Z  U  V
0   111  1  0  1  0
1   222  1  0  1  0
2   333  0  1  0  0
3   444  0  1  0  1
4   555  0  0  0  1

如何仅在具有适当数量的类别和列中足够的份额类别的变量上制作虚拟列？

问题描述

1 个解决方案

解决方案1
0 2023-01-08 19:02:51

如何仅在具有适当数量的类别和列中足够的份额类别的变量上制作虚拟列？

问题描述

1 个解决方案

解决方案1 0 2023-01-08 19:02:51

解决方案1
0 2023-01-08 19:02:51