简体   繁体   English

熊猫-检查每行的多列中是否存在值

[英]Pandas - check if a value exists in multiple columns for each row

I have the following Pandas dataframe: 我有以下熊猫数据框:

Index  Name  ID1  ID2  ID3
    1  A     Y    Y    Y
    2  B     Y    Y        
    3  B     Y              
    4  C               Y

I wish to add a new column 'Multiple' to indicate those rows where there is a value Y in more than one of the columns ID1, ID2, and ID3. 我希望添加一个新列“ Multiple”,以指示那些在ID1,ID2和ID3列中不止一个的值Y的行。

Index  Name  ID1  ID2  ID3 Multiple
    1  A     Y    Y    Y   Y
    2  B     Y    Y        Y
    3  B     Y             N
    4  C               Y   N

I'd normally use np.where or np.select eg: 我通常会使用np.wherenp.select例如:

df['multiple'] = np.where(<More than 1 of ID1, ID2 or ID3 have a Y in>), 'Y', 'N')

but I can't figure out how to write the conditional. 但我不知道如何写条件。 There might be a growing number of ID columns so I couldn't cover every combination as a separate condition (eg (ID1 = Y and ID3 = Y) or (ID2 = Y and ID3 = Y) . I think I perhaps want something which counts the Y values across named columns? 可能会有越来越多的ID列,所以我不能将每种组合作为单独的条件(例如(ID1 = Y and ID3 = Y) or (ID2 = Y and ID3 = Y) 。我想我可能想要一些计算命名列中的Y值?

Outside of Pandas I would think about working with a list, appending the values for each column where Y and then see if the list had a length of greater than 1. 在Pandas之外,我会考虑使用一个列表,将每个列的值附加到Y处,然后查看列表的长度是否大于1。

But I cant think how to do it within the limitations of np.where , np.select or df.loc . 但是我想不出如何在np.wherenp.selectdf.loc的限制内执行此df.loc Any pointers? 有指针吗?

using numpy to sum by row to occurrences of Y should do it: 使用numpy逐行求和Y的出现,应该这样做:

df['multi'] = ['Y' if x > 1 else 'N' for x in np.sum(df.values == 'Y', 1)]

output: 输出:

      Name ID1   ID2   ID3 multi
Index                           
1        A   Y     Y     Y     Y
2        B   Y     Y  None     Y
3        B   Y  None  None     N
4        C   Y  None  None     N

I would do it like this: 我会这样做:

Get a list of the columns you want to check. 获取要检查的列的列表。

    cols = [x for x in testdf.columns if "id" in x]

You can use the filter method on DataFrame if you want for this, but I think explicitly selecting the list of columns is clearer, and you have full flexibility to change your conditions later. 如果需要,可以在DataFrame上使用filter方法,但是我认为显式选择列列表更加清晰,并且您可以灵活地在以后更改条件。

After that, it's just: 在那之后,它就是:

    testdf["multiple"] = (testdf[cols]=="Y").any(axis="columns")

Explanation: 说明:

  • testdf[cols] returns a DataFrame conisisting of just the columns you have selected for in the first line. testdf[cols]返回仅由您在第一行中选择的列组成的DataFrame。
  • testdf[cols]=="Y" returns a DataFrame populated with True or False as per the condition "==Y". testdf[cols]=="Y"返回根据条件“ == Y”填充为True或False的DataFrame。
  • ().any(axis="columns") scans across the columns of this DataFrame and, for each row, returns True for if any of the items in the row are True, and False otherwise. ().any(axis =“ columns”)扫描此DataFrame的各列,对于每一行,如果该行中的任何项目为True,则返回True,否则返回False。

If you really want you can change the True values to "Y" and the False values to "N". 如果确实需要,可以将True值更改为“ Y”,将False值更改为“ N”。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM