简体   繁体   中英

pandas idxmax: return all rows in case of ties

I am working with a dataframe where I have weight each row by its probability. Now, I want to select the row with the highest probability and I am using pandas idxmax() to do so, however when there are ties, it just returns the first row among the ones that tie. In my case, I want to get all the rows that tie .

Furthermore, I am doing this as part of a research project where I am processing millions a dataframes like the one below, so keeping it fast is an issue.

Example:

My data looks like this:

data = [['chr1',100,200,0.2],
    ['ch1',300,500,0.3],
    ['chr1', 300, 500, 0.3],
    ['chr1', 600, 800, 0.3]]

From this list, I create a pandas dataframe as follows:

weighted = pd.DataFrame.from_records(data,columns=['chrom','start','end','probability'])

Which looks like this:

  chrom  start  end  probability
0  chr1    100  200          0.2
1   ch1    300  500          0.3
2  chr1    300  500          0.3
3  chr1    600  800          0.3

Then select the row that fits argmax(probability) using:

selected =  weighted.ix[weighted['probability'].idxmax()]

Which of course returns:

chrom          ch1
start          300
end            500
probability    0.3
Name: 1, dtype: object

Is there a (fast) way to the get all the values when there are ties?

thanks!

The bottleneck lies in calculating the Boolean indexer. You can bypass the overhead associated with pd.Series objects by performing calculations with the underlying NumPy array:

df2 = df[df['probability'].values == df['probability'].values.max()]

Performance benchmarking with the Pandas equivalent:

# tested on Pandas v0.19.2, Python 3.6.0

df = pd.concat([df]*100000, ignore_index=True)

%timeit df['probability'].eq(df['probability'].max())               # 3.78 ms per loop
%timeit df['probability'].values == df['probability'].values.max()  # 416 µs per loop

Well, this might be solution you are looking for:

weighted.loc[weighted['probability']==weighted['probability'].max()].T
#               1     2     3
#chrom        ch1  chr1  chr1
#start        300   300   600
#end          500   500   800
#probability  0.3   0.3   0.3

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM