新的 DataFrame boolean 列检查某些列是否等于 1

Question

I have the following pd.DataFrame and list of columns:我有以下pd.DataFrame和列列表：

col_list = ["med_a", "med_c"]
df = pd.DataFrame.from_dict({'med_a': [0, 0, 1, 0], 'med_b': [0, 0, 1, 1], 'med_c': [0, 1, 1, 0]})

print(df)
>>>
    med_a   med_b   med_c
0   0       0       0
1   0       0       1
2   1       1       1
3   0       1       0

I want to make a new column ( new_col ) that holds either True/False (or 0/1) if any of the columns in col_list is equal to 1, for each row.如果col_list中的任何列等于 1，我想为每一行创建一个新列（ new_col ），该列包含 True/False（或 0/1）。 So the result should become:所以结果应该变成：

     med_a  med_b   med_c   new_col
0   0       0       0       0
1   0       0       1       1
2   1       1       1       1
3   0       1       0       0

I know how to select only those rows where at least one of the columns in is equal to 1, but that doesn't check only those columns in col_list , and it doesn't create a new column:我知道如何 select 仅在其中至少一列等于 1 的那些行中，但不只检查col_list中的那些列，并且它不会创建新列：

df[(df== 1).any(axis=1)]

print(df)
>>>
    med_a   med_b   med_c
1   0       0       1
2   1       1       1
3   0       1       1

How would I achieve the desired result?我将如何达到预期的结果？ Any help is appreciated.任何帮助表示赞赏。

Answer 1

You're so close!你这么近！ Just filter the df with the col_list before any on axis=1 + astype(int) .只需在 axis=1 + astype(int)上的any之前使用col_list过滤 df 。

import numpy as np
import pandas as pd

col_list = ["med_a", "med_c"]
df = pd.DataFrame.from_dict({'med_a': [0, 0, 1, 0],
                             'med_b': [0, 0, 1, 1],
                             'med_c': [0, 1, 1, 0]})


df['new_col'] = df[col_list].any(axis=1).astype(int)

print(df)

Or via np.where :或通过np.where ：

df['new_col'] = np.where(df[col_list].any(axis=1), 1, 0)

   med_a  med_b  med_c  new_col
0      0      0      0        0
1      0      0      1        1
2      1      1      1        1
3      0      1      0        0

Timing information via perfplot:通过 perfplot 的时序信息：

np.where is faster than astype(int) up to 100,000 rows at which point they are about the same. np.where比astype(int)快多达 100,000 行，此时它们大致相同。

import numpy as np
import pandas as pd
import perfplot

np.random.seed(5)
col_list = ["med_a", "med_c"]


def gen_data(n):
    return pd.DataFrame.from_dict({'med_a': np.random.choice([0, 1], size=n),
                                   'med_b': np.random.choice([0, 1], size=n),
                                   'med_c': np.random.choice([0, 1], size=n)})


def np_where(df):
    df['new_col'] = np.where(df[col_list].any(axis=1), 1, 0)
    return df


def astype_int(df):
    df['new_col'] = df[col_list].any(axis=1).astype(int)
    return df


if __name__ == '__main__':
    out = perfplot.bench(
        setup=gen_data,
        kernels=[
            np_where,
            astype_int
        ],
        labels=[
            'np_where',
            'astype_int'
        ],
        n_range=[2 ** k for k in range(25)],
        equality_check=None
    )
    out.save('perfplot_results.png', transparent=False)

新的 DataFrame boolean 列检查某些列是否等于 1

问题描述

1 个解决方案

解决方案1
2 已采纳 2021-05-23 18:14:40

新的 DataFrame boolean 列检查某些列是否等于 1

问题描述

1 个解决方案

解决方案1 2 已采纳 2021-05-23 18:14:40

解决方案1
2 已采纳 2021-05-23 18:14:40