[英]New DataFrame boolean column that checks whether or not any of certain columns equal 1
I have the following pd.DataFrame
and list of columns:我有以下
pd.DataFrame
和列列表:
col_list = ["med_a", "med_c"]
df = pd.DataFrame.from_dict({'med_a': [0, 0, 1, 0], 'med_b': [0, 0, 1, 1], 'med_c': [0, 1, 1, 0]})
print(df)
>>>
med_a med_b med_c
0 0 0 0
1 0 0 1
2 1 1 1
3 0 1 0
I want to make a new column ( new_col
) that holds either True/False (or 0/1) if any of the columns in col_list
is equal to 1, for each row.如果
col_list
中的任何列等于 1,我想为每一行创建一个新列( new_col
),该列包含 True/False(或 0/1)。 So the result should become:所以结果应该变成:
med_a med_b med_c new_col
0 0 0 0 0
1 0 0 1 1
2 1 1 1 1
3 0 1 0 0
I know how to select only those rows where at least one of the columns in is equal to 1, but that doesn't check only those columns in col_list
, and it doesn't create a new column:我知道如何 select 仅在其中至少一列等于 1 的那些行中,但不只检查
col_list
中的那些列,并且它不会创建新列:
df[(df== 1).any(axis=1)]
print(df)
>>>
med_a med_b med_c
1 0 0 1
2 1 1 1
3 0 1 1
How would I achieve the desired result?我将如何达到预期的结果? Any help is appreciated.
任何帮助表示赞赏。
You're so close!你这么近! Just filter the df with the
col_list
before any
on axis=1 + astype(int)
.只需在 axis=1 +
astype(int)
上的any
之前使用col_list
过滤 df 。
import numpy as np
import pandas as pd
col_list = ["med_a", "med_c"]
df = pd.DataFrame.from_dict({'med_a': [0, 0, 1, 0],
'med_b': [0, 0, 1, 1],
'med_c': [0, 1, 1, 0]})
df['new_col'] = df[col_list].any(axis=1).astype(int)
print(df)
Or via np.where
:或通过
np.where
:
df['new_col'] = np.where(df[col_list].any(axis=1), 1, 0)
med_a med_b med_c new_col
0 0 0 0 0
1 0 0 1 1
2 1 1 1 1
3 0 1 0 0
Timing information via perfplot:通过 perfplot 的时序信息:
np.where
is faster than astype(int)
up to 100,000 rows at which point they are about the same. np.where
比astype(int)
快多达 100,000 行,此时它们大致相同。
import numpy as np
import pandas as pd
import perfplot
np.random.seed(5)
col_list = ["med_a", "med_c"]
def gen_data(n):
return pd.DataFrame.from_dict({'med_a': np.random.choice([0, 1], size=n),
'med_b': np.random.choice([0, 1], size=n),
'med_c': np.random.choice([0, 1], size=n)})
def np_where(df):
df['new_col'] = np.where(df[col_list].any(axis=1), 1, 0)
return df
def astype_int(df):
df['new_col'] = df[col_list].any(axis=1).astype(int)
return df
if __name__ == '__main__':
out = perfplot.bench(
setup=gen_data,
kernels=[
np_where,
astype_int
],
labels=[
'np_where',
'astype_int'
],
n_range=[2 ** k for k in range(25)],
equality_check=None
)
out.save('perfplot_results.png', transparent=False)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.