In python2.7, I create a pandas dataframe of the following form:
import pandas as pd
df = pd.DataFrame({
'ID' : ['1','2','3'],
'sps1' : ['1001', '1111', '1000'],
'sps2' : ['1001','0001','NaN'],
'sps3' : ['1001','NaN','1000'],
'sps4' : ['1001','1101','0101']
})
Thus it looks like:
ID sps1 sps2 sps3 sps4
0 1 1001 1001 1001 1001
1 2 1111 0001 NaN 1101
2 3 1000 NaN 1000 0101
Each row contains data on a different biological sequence, which possesses a unique ID (1, 2, 3 etc). Each sequence is present in 4 different species (sps1-4). The presence (1) or absence (0) of 4 different features in each sequence is encoded as a 4-digit code. The sequence is missing from some species, thus NaN is recorded.
From this dataframe, I want to select rows where the code for sps1 does not match the code for all other species.
So in the eg above, I want to discard row 0 (code 1001 is same for all sps) and row 2 (sps1 code 1000 matches that of sps3), but to keep row 1 (sps1 code 1111 is unique).
Ultimately I want to put these selected rows in a new dataframe with the same structure.
I am new to using pandas. So far I managed to find a way to do it like this:
matches = df.loc[( (df['sps1'] != df['sps2']) & (df['sps1'] != df['sps3']) )].index
df_match = df.iloc[matches]
I could continue this style for all combinations of sps1 and spsX, but in my full analysis I will be handling upwards of 12 species, so this is a lot of typing and not very efficient. I guess there must be a cleaner way?
You can use filter
to select columns by pattern, and use eq
to check if the sps1
column is equal to all other columns, here specify axis = rows
to compare column-wise. This produces a logical vector which you can use for subsetting:
df[(df.filter(regex = "^sps").eq(df.sps1, axis="rows")).sum(axis=1) == 1]
# ID sps1 sps2 sps3 sps4
#1 2 1111 0001 NaN 1101
You have guessed correctly:
df.loc[[df.iloc[i,1:].duplicated().sum() == 0 for i in df.index]]
Result:
ID sps1 sps2 sps3 sps4
1 2 1111 0001 NaN 1101
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.