The following table contains some keys and values:
N = 100
tbl = pd.DataFrame({'key':np.random.randint(0, 10, N),
'y':np.random.rand(N), 'z':np.random.rand(N)})
I would like to obtain a DataFrame in which each row contains a key and all the fields that correspond to the minimal value of a specified field.
Since the original table is very large, I'm interested in the most efficient way.
NOTE getting the minimal value of a field is simple:
tbl.groupby('key').agg(pd.Series.min)
But this takes the minimum values of every field, independently, I would like to know what is the minimum value of y
and what z
value corresponds to it.
Below I post an answer to my question with my naive approach, but I suspect there are better ways
Here is a straightforward approach:
gr = tbl.groupby('key')
def take_min_y(t):
ix = t.y.argmin()
return t.loc[[ix]]
tbl_mins = gr.apply(take_min_y)
Is there a better way?
Based on your updated edit I believe the following is what you want:
In [107]:
tbl.iloc[gr['y'].agg(pd.Series.idxmin)]
Out[107]:
key y z
47 0 0.094841 0.221435
26 1 0.062200 0.748082
45 2 0.032497 0.160199
28 3 0.002242 0.064829
73 4 0.122438 0.723844
75 5 0.128193 0.638933
79 6 0.071833 0.952624
86 7 0.058974 0.113317
36 8 0.068757 0.611111
12 9 0.082604 0.271268
idxmin
returns the index of the min value, we can then use this to filter the original dataframe to select these rows.
Timings show this method is approx 7 times faster:
In [108]:
%timeit tbl.iloc[gr['y'].agg(pd.Series.idxmin)]
def take_min_y(t):
ix = t.y.argmin()
return t.loc[[ix]]
%timeit tbl_mins = gr.apply(take_min_y)
1000 loops, best of 3: 1.08 ms per loop
100 loops, best of 3: 7.06 ms per loop
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.