I am working with a dataframe where I have weight each row by its probability. Now, I want to select the row with the highest probability and I am using pandas idxmax() to do so, however when there are ties, it just returns the first row among the ones that tie. In my case, I want to get all the rows that tie .
Furthermore, I am doing this as part of a research project where I am processing millions a dataframes like the one below, so keeping it fast is an issue.
Example:
My data looks like this:
data = [['chr1',100,200,0.2],
['ch1',300,500,0.3],
['chr1', 300, 500, 0.3],
['chr1', 600, 800, 0.3]]
From this list, I create a pandas dataframe as follows:
weighted = pd.DataFrame.from_records(data,columns=['chrom','start','end','probability'])
Which looks like this:
chrom start end probability
0 chr1 100 200 0.2
1 ch1 300 500 0.3
2 chr1 300 500 0.3
3 chr1 600 800 0.3
Then select the row that fits argmax(probability) using:
selected = weighted.ix[weighted['probability'].idxmax()]
Which of course returns:
chrom ch1
start 300
end 500
probability 0.3
Name: 1, dtype: object
Is there a (fast) way to the get all the values when there are ties?
thanks!
The bottleneck lies in calculating the Boolean indexer. You can bypass the overhead associated with pd.Series
objects by performing calculations with the underlying NumPy array:
df2 = df[df['probability'].values == df['probability'].values.max()]
Performance benchmarking with the Pandas equivalent:
# tested on Pandas v0.19.2, Python 3.6.0
df = pd.concat([df]*100000, ignore_index=True)
%timeit df['probability'].eq(df['probability'].max()) # 3.78 ms per loop
%timeit df['probability'].values == df['probability'].values.max() # 416 µs per loop
Well, this might be solution you are looking for:
weighted.loc[weighted['probability']==weighted['probability'].max()].T
# 1 2 3
#chrom ch1 chr1 chr1
#start 300 300 600
#end 500 500 800
#probability 0.3 0.3 0.3
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.