简体   繁体   中英

Pandas - How to drop rows based on a unique column value where another column value is a minimum and handling nulls?

I have a pandas dataframe with something like the following:

index order_id cost
123a 123 5
123b 123 None
123c 123 3
124a 124 None
124b 124 None

For each unique value of order_id, I'd like to drop any row that isn't the lowest cost. For any order_id that only contains nulls for the cost, any row for an order_id can be retained.

I've been struggling with this for a while now.

ol3 = ol3.loc[ol3.groupby('Order_ID').cost.idxmin()]

This code doesn't play nice with the order_id's that have only nulls. So, I tried to figure out how to drop the null's I don't want with

ol4 = ol3.loc[ol3['cost'].isna()].drop_duplicates(subset=['Order_ID', 'cost'], keep='first')

This gives me the list of null order_id's I want to retain. Not sure where to go from here. I'm pretty sure I'm looking at this the wrong way. Any help would be appreciated!

You can use transform to get the indexes with min cost per order_id . We additionally need isna check for the special order_ids that have only NaN s:

order_mins = df.groupby('order_id').cost.transform('min')
df[(df.cost == order_mins) | (order_mins.isna())]
cond_1 = df.cost.eq(df.cost.groupby(df.order_id).transform("min")) 
cond_2 = df.cost.isna().groupby(df.order_id).transform("all")
new    = df[cond_1 | cond_2]
  • condition 1: check if a cost is equal to its group's minimum
  • condition 2: check if a group is full of missings
  • if either of these is true, then keep corresponding rows
In [246]: cond_1
Out[246]:
0    False
1    False
2     True            <--- cost equals to minimum of group
3    False
4    False
Name: cost, dtype: bool

In [247]: cond_2
Out[247]:
0    False
1    False
2    False
3     True           <--- the ID of these has all NaNs  
4     True           <--- in the cost part (id 124)
Name: cost, dtype: bool

In [248]: new
Out[248]:
  index  order_id  cost
2  123c       123   3.0
3  124a       124   NaN
4  124b       124   NaN

i did df.cost = pd.to_numeric(df.cost, errors="coerce") prior to above.

You can (temporarily) fill the NA/None with np.inf before getting the idxmin:

ol3.loc[ol3['cost'].fillna(np.inf).groupby(ol3['order_id']).idxmin()]

You will have exactly one row per order_id

output:

  index  order_id  cost
2  123c       123   3.0
3  124a       124   NaN

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM