熊猫-检查每行的多列中是否存在值

Question

I have the following Pandas dataframe: 我有以下熊猫数据框：

Index  Name  ID1  ID2  ID3
    1  A     Y    Y    Y
    2  B     Y    Y        
    3  B     Y              
    4  C               Y

I wish to add a new column 'Multiple' to indicate those rows where there is a value Y in more than one of the columns ID1, ID2, and ID3. 我希望添加一个新列“ Multiple”，以指示那些在ID1，ID2和ID3列中不止一个的值Y的行。

Index  Name  ID1  ID2  ID3 Multiple
    1  A     Y    Y    Y   Y
    2  B     Y    Y        Y
    3  B     Y             N
    4  C               Y   N

I'd normally use np.where or np.select eg: 我通常会使用np.where或np.select例如：

df['multiple'] = np.where(<More than 1 of ID1, ID2 or ID3 have a Y in>), 'Y', 'N')

but I can't figure out how to write the conditional. 但我不知道如何写条件。 There might be a growing number of ID columns so I couldn't cover every combination as a separate condition (eg (ID1 = Y and ID3 = Y) or (ID2 = Y and ID3 = Y) . I think I perhaps want something which counts the Y values across named columns? 可能会有越来越多的ID列，所以我不能将每种组合作为单独的条件（例如(ID1 = Y and ID3 = Y) or (ID2 = Y and ID3 = Y) 。我想我可能想要一些计算命名列中的Y值？

Outside of Pandas I would think about working with a list, appending the values for each column where Y and then see if the list had a length of greater than 1. 在Pandas之外，我会考虑使用一个列表，将每个列的值附加到Y处，然后查看列表的长度是否大于1。

But I cant think how to do it within the limitations of np.where , np.select or df.loc . 但是我想不出如何在np.where ， np.select或df.loc的限制内执行此df.loc 。 Any pointers? 有指针吗？

Answer 1

using numpy to sum by row to occurrences of Y should do it: 使用numpy逐行求和Y的出现，应该这样做：

df['multi'] = ['Y' if x > 1 else 'N' for x in np.sum(df.values == 'Y', 1)]

output: 输出：

      Name ID1   ID2   ID3 multi
Index                           
1        A   Y     Y     Y     Y
2        B   Y     Y  None     Y
3        B   Y  None  None     N
4        C   Y  None  None     N

Answer 2

I would do it like this: 我会这样做：

Get a list of the columns you want to check. 获取要检查的列的列表。

    cols = [x for x in testdf.columns if "id" in x]

You can use the filter method on DataFrame if you want for this, but I think explicitly selecting the list of columns is clearer, and you have full flexibility to change your conditions later. 如果需要，可以在DataFrame上使用filter方法，但是我认为显式选择列列表更加清晰，并且您可以灵活地在以后更改条件。

After that, it's just: 在那之后，它就是：

    testdf["multiple"] = (testdf[cols]=="Y").any(axis="columns")

Explanation: 说明：

testdf[cols] returns a DataFrame conisisting of just the columns you have selected for in the first line. testdf[cols]返回仅由您在第一行中选择的列组成的DataFrame。
testdf[cols]=="Y" returns a DataFrame populated with True or False as per the condition "==Y". testdf[cols]=="Y"返回根据条件“ == Y”填充为True或False的DataFrame。
().any(axis="columns") scans across the columns of this DataFrame and, for each row, returns True for if any of the items in the row are True, and False otherwise. （）.any（axis =“ columns”）扫描此DataFrame的各列，对于每一行，如果该行中的任何项目为True，则返回True，否则返回False。

If you really want you can change the True values to "Y" and the False values to "N". 如果确实需要，可以将True值更改为“ Y”，将False值更改为“ N”。

熊猫-检查每行的多列中是否存在值

问题描述

2 个解决方案

解决方案1
1 已采纳 2019-06-24 15:09:46

解决方案2
0 2019-06-24 15:35:03

熊猫-检查每行的多列中是否存在值

问题描述

2 个解决方案

解决方案1 1 已采纳 2019-06-24 15:09:46

解决方案2 0 2019-06-24 15:35:03

解决方案1
1 已采纳 2019-06-24 15:09:46

解决方案2
0 2019-06-24 15:35:03