I need to group a Pandas dataframe by date, and then take a weighted average of given values. Here's how it's currently done using the margin value as an example (and it works perfectly until there are NaN values):
df = orders.copy()
# Create new columns as required
df['margin_WA'] = df['net_margin'].astype(float) # original data as str or Decimal
def group_wa():
return lambda num: np.average(num, weights=df.loc[num.index, 'order_amount'])
agg_func = {
'margin_WA': group_wa(), # agg_func includes WAs for other elements
}
result = df.groupby('order_date').agg(agg_func)
result['margin_WA'] = result['margin_WA'].astype(str)
In the case where 'net_margin'
fields contain NaN
values, the WA is set to NaN. I can't seem to be able to dropna()
or filtering by pd.notnull
when creating new columns, and I don't know where to create a masked array to avoid passing NaN
to the group_wa
function ( like suggested here ). How do I ignore NaN
in this case?
I think a simple solution is to drop the missing values before you groupby/aggregate like:
result = df.dropna(subset='margin_WA').groupby('order_date').agg(agg_func)
In this case, no indices containing missings are passed to your group_wa
function.
Another approach is to move the dropna
into your aggregating function like:
def group_wa(series):
dropped = series.dropna()
return np.average(dropped, weights=df.loc[dropped.index, 'order_amount'])
agg_func = {'margin_WA': group_wa}
result = df.groupby('order_date').agg(agg_func)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.