[英]Using custom functions on pandas group by aggregating
I have a dataframe like this,我有一个这样的数据框,
>>> data = {
'year':[2019, 2020, 2020, 2019, 2020, 2019],
'provider':['X', 'X', 'Y', 'Z', 'Z', 'T'],
'price':[100, 122, 0, 150, 120, 80],
'count':[20, 15, 24, 16, 24, 10]
}
>>> df = pd.DataFrame(data)
>>> df
year provider price count
0 2019 X 100 20
1 2020 X 122 15
2 2020 Y 0 24
3 2019 Z 150 16
4 2020 Z 120 24
5 2019 T 80 10
And this is expected output:这是预期的输出:
provider price_rate count_rate
0 X 0.22 -0.25
1 Z -0.20 0.50
I want to group prices on providers and find price, count differences between 2019 and 2020. If there is no price or count record at 2020 or 2019, don't want to see related provider.我想在供应商上分组价格并找到价格,计算2019年和2020年之间的差异。如果2020年或2019年没有价格或计数记录,则不想看到相关供应商。
By the assumption that there are always only 1 or 2 rows per provider, we can first sort_values
on year
to make sure 2019
comes before 2020
.假设每个提供程序始终只有 1 或 2 行,我们可以首先对year
进行sort_values
以确保2019
出现在2020
。
Then we groupby
on provider and divide
the rows of price
and count
and substract 1.然后我们在 provider 上groupby
并divide
price
和count
和减去 1 的行。
df = df.sort_values('year')
grp = (
df.groupby('provider')
.apply(lambda x: x[['price', 'count']].div(x[['price', 'count']].shift()).sub(1))
)
dfnew = df[['provider']].join(grp).dropna()
provider price count
1 X 0.22 -0.25
4 Z -0.20 0.50
Or only vectorized methods:或者只有矢量化方法:
dfnew = df[df['provider'].duplicated(keep=False)].sort_values(['provider', 'year'])
dfnew[['price', 'count']] = (
dfnew[['price', 'count']].div(dfnew[['price', 'count']].shift()).sub(1)
)
dfnew = dfnew[dfnew['provider'].eq(dfnew['provider'].shift())].drop('year', axis=1)
provider price count
1 X 0.22 -0.25
4 Z -0.20 0.50
You can try:你可以试试:
final = (df.set_index(['provider','year']).groupby(level=0)
.pct_change().dropna().droplevel(1).add_suffix('_count').reset_index())
provider price_rate count_rate
0 X 0.22 -0.25
1 Z -0.20 0.50
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.