Python：在 Pandas 数据框中组合布尔列

Question

我有以下数据

attr1_A    attr1_B    attr1_C    attr1_D    attr2_A    attr2_B   attr2_C
      1          0          0          1          1          0         0
      0          1          1          0          0          0         1
      0          0          0          0          0          1         0
      1          1          1          0          1          1         0

我想保留attr1_A 、 attr1_B并将attr1_C和attr1_D合并到attr1_others 。 只要attr1_C和/或attr1_D为 1，则attr1_others将为 1。同样，我想保留attr2_A但将剩余的attr2_*合并到attr2_others 。 像这样：

attr1_A    attr1_B    attr1_others    attr2_A    attr2_others
      1          0          1               1               0     
      0          1          1               0               1  
      0          0          0               0               1 
      1          1          1               1               1

换句话说，对于任何一组attr ，我想保留一些已知的列，但合并其余的（我不知道同一组的剩余attr有多少。

加工：我想单独做每组attr1_* ，然后attr2_*因为有每个组下的有限数量在我的数据集群体，但很多ATTR。

我现在能想到的是检索others列，例如：

# for group 1
df[x for x in df.columns if "A" not in x and "B" not in x and "attr1_" in x]

# for group 2
df[x for x in df.columns if "A" not in x and "attr2_" in x]

结合起来，我正在考虑使用any函数，但我想不出语法。 你能帮忙吗？

更新尝试：

我试过这个

# for group 1
df['attr1_others'] = df[df[[x for x in list(df.columns) 
                            if "attr1_" in x
                            and "A" not in x 
                            and "B" not in x]].any(axis = 'column')]

但得到以下错误：

ValueError：对象类型<类'pandas.core.frame.DataFrame'>没有轴命名列

Answer 1

数据帧具有强大的就地操作数据的能力，而无需编写复杂的 Python 逻辑。

要创建您的attr1_others和attr2_others列，您可以使用or使用以下条件组合列：

df['attr1_others'] = df['attr1_C'] | df['attr1_D']
df['attr2_others'] = df['attr2_B'] | df['attr2_C']

相反，如果你想要一个and条件，你可以使用：

df['attr1_others'] = df['attr1_C'] & df['attr1_D']
df['attr2_others'] = df['attr2_B'] & df['attr2_C']

然后，您可以使用del删除挥之不去的原始值：

del df['attr1_C']
del df['attr1_D']
del df['attr2_B']
del df['attr2_C']

Answer 2

创建保留列的列表。 删除那些保留的列并将剩余的列分配给新的数据df1 。 Groupby df1按拆分的列名； 在轴=1 上调用any ； add_suffix '_others' 并将结果分配给df2 。 最后，join 和 sort_index

keep_cols = ['attr1_A', 'attr1_B', 'attr2_A']
df1 = df.drop(keep_cols,1)
df2 = (df1.groupby(df1.columns.str.split('_').str[0], axis=1)
          .any(1).add_suffix('_others').astype(int))

Out[512]:
   attr1_others  attr2_others
0             1             0
1             1             1
2             0             1
3             1             1

df_final = df[keep_cols].join(df2).sort_index(1)

Out[514]:
   attr1_A  attr1_B  attr1_others  attr2_A  attr2_others
0        1        0             1        1             0
1        0        1             1        0             1
2        0        0             0        0             1
3        1        1             1        1             1

Answer 3

您可以使用自定义列表来选择列，然后使用.any()和axis=1参数。 要转换为整数，请使用.astype(int) 。

例如：

import pandas as pd

df = pd.DataFrame({
        'attr1_A': [1, 0, 0, 1],
        'attr1_B': [0, 1, 0, 1],
        'attr1_C': [0, 1, 0, 1],
        'attr1_D': [1, 0, 0, 0],
        'attr2_A': [1, 0, 0, 1],
        'attr2_B': [0, 0, 1, 1],
        'attr2_C': [0, 1, 0, 0]})

cols = [col for col in df.columns.values if col.startswith('attr1') and col.split('_')[1] not in ('A', 'B')]
df['attr1_others'] = df[cols].any(axis=1).astype(int)
df.drop(cols, axis=1, inplace=True)

cols = [col for col in df.columns.values if col.startswith('attr2') and col.split('_')[1] not in ('A', )]
df['attr2_others'] = df[cols].any(axis=1).astype(int)
df.drop(cols, axis=1, inplace=True)

print(df)

印刷：

   attr1_A  attr1_B  attr2_A  attr1_others  attr2_others
0        1        0        1             1             0
1        0        1        0             1             1
2        0        0        0             0             1
3        1        1        1             1             1

Python：在 Pandas 数据框中组合布尔列

问题描述

3 个解决方案

解决方案1
2 2019-12-19 22:40:52

解决方案2
1 2019-12-20 00:07:44

解决方案3
0 已采纳 2019-12-19 23:48:54

Python：在 Pandas 数据框中组合布尔列

问题描述

3 个解决方案

解决方案1 2 2019-12-19 22:40:52

解决方案2 1 2019-12-20 00:07:44

解决方案3 0 已采纳 2019-12-19 23:48:54

解决方案1
2 2019-12-19 22:40:52

解决方案2
1 2019-12-20 00:07:44

解决方案3
0 已采纳 2019-12-19 23:48:54