[英]Pandas - check if a value exists in multiple columns for each row
I have the following Pandas dataframe: 我有以下熊猫数据框:
Index Name ID1 ID2 ID3
1 A Y Y Y
2 B Y Y
3 B Y
4 C Y
I wish to add a new column 'Multiple' to indicate those rows where there is a value Y in more than one of the columns ID1, ID2, and ID3. 我希望添加一个新列“ Multiple”,以指示那些在ID1,ID2和ID3列中不止一个的值Y的行。
Index Name ID1 ID2 ID3 Multiple
1 A Y Y Y Y
2 B Y Y Y
3 B Y N
4 C Y N
I'd normally use np.where
or np.select
eg: 我通常会使用
np.where
或np.select
例如:
df['multiple'] = np.where(<More than 1 of ID1, ID2 or ID3 have a Y in>), 'Y', 'N')
but I can't figure out how to write the conditional. 但我不知道如何写条件。 There might be a growing number of ID columns so I couldn't cover every combination as a separate condition (eg
(ID1 = Y and ID3 = Y) or (ID2 = Y and ID3 = Y)
. I think I perhaps want something which counts the Y values across named columns? 可能会有越来越多的ID列,所以我不能将每种组合作为单独的条件(例如
(ID1 = Y and ID3 = Y) or (ID2 = Y and ID3 = Y)
。我想我可能想要一些计算命名列中的Y值?
Outside of Pandas I would think about working with a list, appending the values for each column where Y and then see if the list had a length of greater than 1. 在Pandas之外,我会考虑使用一个列表,将每个列的值附加到Y处,然后查看列表的长度是否大于1。
But I cant think how to do it within the limitations of np.where
, np.select
or df.loc
. 但是我想不出如何在
np.where
, np.select
或df.loc
的限制内执行此df.loc
。 Any pointers? 有指针吗?
using numpy to sum by row to occurrences of Y should do it: 使用numpy逐行求和Y的出现,应该这样做:
df['multi'] = ['Y' if x > 1 else 'N' for x in np.sum(df.values == 'Y', 1)]
output: 输出:
Name ID1 ID2 ID3 multi
Index
1 A Y Y Y Y
2 B Y Y None Y
3 B Y None None N
4 C Y None None N
I would do it like this: 我会这样做:
Get a list of the columns you want to check. 获取要检查的列的列表。
cols = [x for x in testdf.columns if "id" in x]
You can use the filter
method on DataFrame if you want for this, but I think explicitly selecting the list of columns is clearer, and you have full flexibility to change your conditions later. 如果需要,可以在DataFrame上使用
filter
方法,但是我认为显式选择列列表更加清晰,并且您可以灵活地在以后更改条件。
After that, it's just: 在那之后,它就是:
testdf["multiple"] = (testdf[cols]=="Y").any(axis="columns")
Explanation: 说明:
testdf[cols]
returns a DataFrame conisisting of just the columns you have selected for in the first line. testdf[cols]
返回仅由您在第一行中选择的列组成的DataFrame。 testdf[cols]=="Y"
returns a DataFrame populated with True or False as per the condition "==Y". testdf[cols]=="Y"
返回根据条件“ == Y”填充为True或False的DataFrame。 If you really want you can change the True values to "Y" and the False values to "N". 如果确实需要,可以将True值更改为“ Y”,将False值更改为“ N”。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.