[英]Merge Pandas Dataframe Rows based on multiple conditions
Hi I have a pandas df which contains dates and amounts.您好,我有一个 pandas df,其中包含日期和金额。
Date Amount
0 10/02/22 1600
1 10/02/22 150
2 11/02/22 100
3 11/02/22 800
4 11/02/22 125
If an entry is one day later and less than 10% of any other entry I would like to sum the amounts and then take the earliest date.如果一个条目晚了一天并且少于任何其他条目的 10%,我想将金额相加然后取最早的日期。
So the df would look like:所以 df 看起来像:
Date Amount
0 10/02/22 1825
1 10/02/22 150
2 11/02/22 800
I've tried creating threshold and then creating groups based on these conditions but this does not yield expected results.我试过创建阈值,然后根据这些条件创建组,但这并没有产生预期的结果。
threshold_selector = (amount_difference < 0.1) & (date_difference == day)
Where day is a time delta of one day其中一天是一天的时间增量
groups = threshold_selector.cumsum()
dates= dates.groupby(groups).agg({'Amount':sum, 'Date': min})
The result is all rows joined into one.结果是所有行合并为一个。
I would approach this using a pivot
.我会使用
pivot
来解决这个问题。
Sort the values with descending amount and pivot to have the largest value in the first column.对值进行降序排序,将 pivot 的值放在第一列中。 Then find the values lower or equal to 10% that and mask them + add to first column.
然后找到低于或等于 10% 的值并屏蔽它们 + 添加到第一列。 Then shape back to original:
然后变回原来的形状:
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
df = df.sort_values(by=['Date', 'Amount'], ascending=[True, False])
# pivot to have col0 with the largest value per day
df2 = (df
.assign(col=df.groupby('Date').cumcount())
.pivot(index='Date', columns='col', values='Amount')
)
# identify values lower than the 10% of the previous day's max
mask = df2.div(df2[0].shift(1, freq='D'), axis=0).le(0.1).reindex_like(df2)
# add the lower than 10% values
df2[0] += df2.where(mask).sum(axis=1).shift(-1, 'D').reindex(mask.index, fill_value=0)
# mask them
df2 = df2.mask(mask)
# reshape back dropping the NaNs
df2 = df2.stack().droplevel('col').reset_index(name='Amount')
output: output:
Date Amount
0 2022-02-10 1825.0
1 2022-02-10 150.0
2 2022-02-11 800.0
Here is an alternative using a groupby
approach:这是使用
groupby
方法的替代方法:
# ensure datetime
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
# group Amounts by Date
g = df.groupby('Date')['Amount']
# get max amount per date
date_max = g.max()
# shift to previous date
prev_date_max = date_max.shift(1, freq='D').reindex(date_max.index, fill_value=0)
# identify rows to drop later
mask = df['Amount'].div(df['Date'].map(prev_date_max)).le(0.1)
# get value of next day to add to max
val_to_add = (df['Amount'][mask]
.groupby(df['Date']).sum()
.shift(-1, freq='D')
)
# add to max
df['Amount'] += df['Date'].map(val_to_add).where(df.index.isin(g.idxmax())).fillna(0)
# drop rows
df = df.loc[~mask]
output: output:
Date Amount
0 2022-02-10 1825.0
1 2022-02-10 150.0
3 2022-02-11 800.0
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.