I am trying to filter a pandas dataframe based on two columns, so that for each value in column 1 only those rows are left where column 2 is the minimum. I know it sounds confusing like this, so here is an example:
> df = pd.DataFrame([{'a':'anno1', 'ppm':1},{'a':'anno1', 'ppm':2},{'a':'anno2', 'ppm':2},{'a':'anno2', 'ppm':2}])
> df
a ppm
0 anno1 1
1 anno1 2
2 anno2 2
3 anno2 2
And I want rows 0,2 and 3, because for anno1
, the minimum ppm
is 1
, and for anno2
the minimum ppm
is 2
(keep both rows.). So I started with a groupby
:
> grouped_series = df.groupby(['a']).ppm.min()
> grouped_series
a
anno1 1
anno2 2
Now I have for each value in a
the minimum ppm
. But how do I use this series to filter the original dataframe? Or is there even an easier way to do this? I tried several variations of:
new_df = df.loc[ df.loc[:,'ppm']==grouped_series.loc[df.loc[:,'a']] , :]
but this gives me a ValueError: Can only compare identically-labeled Series objects
Use GroupBy.transform
for minimal values to Series
with same size like df
, so compare working nice, also for filtering in boolean indexing
in loc
not necessary:
new_df = df[df['ppm'] == df.groupby('a').ppm.transform('min')]
print (new_df)
a ppm
0 anno1 1
2 anno2 2
3 anno2 2
Here is an alternative approach if you don't mind resetting the original index:
df.merge(df.groupby(['a'])['ppm'].min().reset_index(), how='inner')
Output:
a ppm
0 anno1 1
1 anno2 2
2 anno2 2
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.