简体   繁体   中英

pandas select rows by matching a column entry to entries in multiple other columns

In python2.7, I create a pandas dataframe of the following form:

import pandas as pd

df = pd.DataFrame({
'ID' : ['1','2','3'],
'sps1' : ['1001', '1111', '1000'],
'sps2' : ['1001','0001','NaN'],
'sps3' : ['1001','NaN','1000'],
'sps4' : ['1001','1101','0101']
})

Thus it looks like:

     ID  sps1  sps2  sps3  sps4
0     1  1001  1001  1001  1001
1     2  1111  0001   NaN  1101
2     3  1000   NaN  1000  0101

Each row contains data on a different biological sequence, which possesses a unique ID (1, 2, 3 etc). Each sequence is present in 4 different species (sps1-4). The presence (1) or absence (0) of 4 different features in each sequence is encoded as a 4-digit code. The sequence is missing from some species, thus NaN is recorded.

From this dataframe, I want to select rows where the code for sps1 does not match the code for all other species.

So in the eg above, I want to discard row 0 (code 1001 is same for all sps) and row 2 (sps1 code 1000 matches that of sps3), but to keep row 1 (sps1 code 1111 is unique).

Ultimately I want to put these selected rows in a new dataframe with the same structure.

I am new to using pandas. So far I managed to find a way to do it like this:

matches = df.loc[( (df['sps1'] != df['sps2']) & (df['sps1'] != df['sps3']) )].index
df_match = df.iloc[matches]

I could continue this style for all combinations of sps1 and spsX, but in my full analysis I will be handling upwards of 12 species, so this is a lot of typing and not very efficient. I guess there must be a cleaner way?

You can use filter to select columns by pattern, and use eq to check if the sps1 column is equal to all other columns, here specify axis = rows to compare column-wise. This produces a logical vector which you can use for subsetting:

df[(df.filter(regex = "^sps").eq(df.sps1, axis="rows")).sum(axis=1) == 1]

#  ID   sps1    sps2    sps3    sps4
#1  2   1111    0001     NaN    1101

Psidom已经为你提供了一个很好的答案 ,但是稍微捎带它,你不能包括你正在比较的列,然后使用any()来避免必须对每一行求和。

df[~df.filter(regex="^sps(?!1$)\d+$").eq(df.sps1, axis='rows').any(1)]

You have guessed correctly:

df.loc[[df.iloc[i,1:].duplicated().sum() == 0 for i in df.index]]

Result:

  ID  sps1  sps2 sps3  sps4
1  2  1111  0001  NaN  1101

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM