简体   繁体   English

通过聚合在 Pandas 组上使用自定义函数

[英]Using custom functions on pandas group by aggregating

I have a dataframe like this,我有一个这样的数据框,

>>> data = {
    'year':[2019, 2020, 2020, 2019, 2020, 2019],
    'provider':['X', 'X', 'Y', 'Z', 'Z', 'T'],
    'price':[100, 122, 0, 150, 120, 80],
    'count':[20, 15, 24, 16, 24, 10]
}
>>> df = pd.DataFrame(data)
>>> df
   year provider  price  count
0  2019        X    100     20
1  2020        X    122     15
2  2020        Y      0     24
3  2019        Z    150     16
4  2020        Z    120     24
5  2019        T     80     10

And this is expected output:这是预期的输出:

  provider  price_rate  count_rate
0        X        0.22       -0.25
1        Z       -0.20        0.50

I want to group prices on providers and find price, count differences between 2019 and 2020. If there is no price or count record at 2020 or 2019, don't want to see related provider.我想在供应商上分组价格并找到价格,计算2019年和2020年之间的差异。如果2020年或2019年没有价格或计数记录,则不想看到相关供应商。

By the assumption that there are always only 1 or 2 rows per provider, we can first sort_values on year to make sure 2019 comes before 2020 .假设每个提供程序始终只有 1 或 2 行,我们可以首先对year进行sort_values以确保2019出现在2020

Then we groupby on provider and divide the rows of price and count and substract 1.然后我们在 provider 上groupbydivide pricecount和减去 1 的行。

df = df.sort_values('year')
grp = (
    df.groupby('provider')
      .apply(lambda x: x[['price', 'count']].div(x[['price', 'count']].shift()).sub(1))
)

dfnew = df[['provider']].join(grp).dropna()

  provider  price  count
1        X   0.22  -0.25
4        Z  -0.20   0.50

Or only vectorized methods:或者只有矢量化方法:

dfnew = df[df['provider'].duplicated(keep=False)].sort_values(['provider', 'year'])
dfnew[['price', 'count']] = (
    dfnew[['price', 'count']].div(dfnew[['price', 'count']].shift()).sub(1)
)

dfnew = dfnew[dfnew['provider'].eq(dfnew['provider'].shift())].drop('year', axis=1)

  provider  price  count
1        X   0.22  -0.25
4        Z  -0.20   0.50

You can try:你可以试试:

final = (df.set_index(['provider','year']).groupby(level=0)
      .pct_change().dropna().droplevel(1).add_suffix('_count').reset_index())

  provider  price_rate  count_rate
0        X        0.22       -0.25
1        Z       -0.20        0.50

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM