简体   繁体   English

在数据框中选择行与列中的任何项目均相等的行

[英]Selecting Rows in Dataframe that have any column equal to any item in a list

Let's say I have the following dataframe and I want to select any row that has any of it's values equal to any item in the list: CodesOfInterest=['A','D'] 假设我有以下数据框,并且我想选择其值等于列表中任何项目的任何行: CodesOfInterest=['A','D']

>>> import pandas as pd
>>> d1=pd.DataFrame([['A','B','C','D'],['D','Q','S', np.nan],['R',np.nan,np.nan,np.nan],[np.nan,'A',np.nan,np.nan]],columns=['Code1','Code2','Code3','Code4'])
>>> d1
  Code1 Code2 Code3 Code4
0     A     B     C     D
1     D     Q     S   NaN
2     R   NaN   NaN   NaN
3   NaN     A   NaN   NaN
>>> 

This can be done pretty easily with one line of code: 只需一行代码即可轻松完成:

>>> CodesOfInterest=['A','D']
>>> d1[(d1.isin(CodesOfInterest)==True).any(1)]
  Code1 Code2 Code3 Code4
0     A     B     C     D
1     D     Q     S   NaN
3   NaN     A   NaN   NaN
>>> 

However say I have the following second dataframe indexed the same as the first that adds a condition to this subset. 但是,我说下面的第二个数据帧的索引与第一个数据帧的索引相同,从而为该子集添加了条件。

>>> d2=pd.DataFrame([[1,0,1,0],[0,1,1, np.nan],[1,np.nan,np.nan,np.nan],[np.nan,1,np.nan,np.nan]],columns=['CodeStatus1','CodeStatus2','CodeStatus3','CodeStatus4'])
>>> d2
   CodeStatus1  CodeStatus2  CodeStatus3  CodeStatus4
0            1            0            1            0
1            0            1            1          NaN
2            1          NaN          NaN          NaN
3          NaN            1          NaN          NaN
>>> 

Now I want to only select rows from my d1 that have any of their values equal to any time in my list AND have their corresponding 'CodeStatus' (from d2) equal to 1. And by corresponding CodeStatus I mean pairs of (Code1, CodeStatus1), (Code2, CodeStatus2), etc. 现在,我只想从d1中选择其值等于列表中任何时间的行,并且其对应的“ CodeStatus”(来自d2)等于1。通过对应的CodeStatus,我指的是(Code1,CodeStatus1 ),(Code2,CodeStatus2)等。

I have a clunky way of doing this that requires looping through each of the 4 Codes and Code Statuses. 我有一个笨拙的方法,需要遍历4个代码和代码状态中的每一个。 See below: 见下文:

>>> bs=[]    
>>> for Num in range(1,5):
...     Code='Code'+str(Num)
...     CodeStatus='CodeStatus'+str(Num)
...     b=(df[Code].isin(CodesOfInterest))&(df[CodeStatus]==1)
...     bs.append(b)
... 
>>> Matches=pd.concat(bs,1)
>>> 
>>> d1[(Matches==True).any(1)]
  Code1 Code2 Code3 Code4
0     A     B     C     D
3   NaN     A   NaN   NaN
>>> 

As you see, record 1 now gets dropped from the dataframe because although it has a column with code 'D', the Code Status for this code is not equal to 1. 如您所见,记录1现在已从数据框中删除,因为尽管它的列中包含代码“ D”,但此代码的代码状态不等于1。

Is there a more elegant way to make this query that doesn't require looping through each column? 有没有一种更优雅的方式来进行此查询,而无需遍历每一列?

您可以通过以下方式实现:

d1[pd.DataFrame((d1.isin(CodesOfInterest)==True).values*(d2==1).values).any(1)]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM