简体   繁体   English

Pandas groupby 多个字段然后 diff

[英]Pandas groupby multiple fields then diff

So my dataframe looks like this:所以我的 dataframe 看起来像这样:

         date    site country  score
0  2018-01-01  google      us    100
1  2018-01-01  google      ch     50
2  2018-01-02  google      us     70
3  2018-01-03  google      us     60
4  2018-01-02  google      ch     10
5  2018-01-01      fb      us     50
6  2018-01-02      fb      us     55
7  2018-01-03      fb      us    100
8  2018-01-01      fb      es    100
9  2018-01-02      fb      gb    100

Each site has a different score depending on the country .每个site都有不同的分数,具体取决于country I'm trying to find the 1/3/5-day difference of score s for each site / country combination.我正在尝试为每个site / country /地区组合找到score的 1/3/5 天差异。

Output should be: Output 应该是:

          date    site country  score  diff
8  2018-01-01      fb      es    100   0.0
9  2018-01-02      fb      gb    100   0.0
5  2018-01-01      fb      us     50   0.0
6  2018-01-02      fb      us     55   5.0
7  2018-01-03      fb      us    100  45.0
1  2018-01-01  google      ch     50   0.0
4  2018-01-02  google      ch     10 -40.0
0  2018-01-01  google      us    100   0.0
2  2018-01-02  google      us     70 -30.0
3  2018-01-03  google      us     60 -10.0

I first tried sorting by site / country / date , then grouping by site and country but I'm not able to wrap my head around getting a difference from a grouped object.我首先尝试按site / country /地区/ date进行排序,然后按sitecountry /地区分组,但我无法理解与分组的 object 的区别。

First, sort the DataFrame and then all you need is groupby.diff() :首先,对 DataFrame 进行排序,然后您只需要groupby.diff()

df = df.sort_values(by=['site', 'country', 'date'])

df['diff'] = df.groupby(['site', 'country'])['score'].diff().fillna(0)

df
Out: 
         date    site country  score  diff
8  2018-01-01      fb      es    100   0.0
9  2018-01-02      fb      gb    100   0.0
5  2018-01-01      fb      us     50   0.0
6  2018-01-02      fb      us     55   5.0
7  2018-01-03      fb      us    100  45.0
1  2018-01-01  google      ch     50   0.0
4  2018-01-02  google      ch     10 -40.0
0  2018-01-01  google      us    100   0.0
2  2018-01-02  google      us     70 -30.0
3  2018-01-03  google      us     60 -10.0

sort_values doesn't support arbitrary orderings. sort_values不支持任意排序。 If you need to sort arbitrarily (google before fb for example) you need to store them in a collection and set your column as categorical.如果您需要任意排序(例如在 fb 之前使用 google),您需要将它们存储在一个集合中并将您的列设置为分类。 Then sort_values will respect the ordering you provided there.然后 sort_values 将尊重您在那里提供的排序。

You can shift and substract grouped values:您可以移动和减去分组值:

df.sort_values(['site', 'country', 'date'], inplace=True)

df['diff'] = df['score'] - df.groupby(['site', 'country'])['score'].shift()

Result:结果:

         date    site country  score  diff
8  2018-01-01      fb      es    100   NaN
9  2018-01-02      fb      gb    100   NaN
5  2018-01-01      fb      us     50   NaN
6  2018-01-02      fb      us     55   5.0
7  2018-01-03      fb      us    100  45.0
1  2018-01-01  google      ch     50   NaN
4  2018-01-02  google      ch     10 -40.0
0  2018-01-01  google      us    100   NaN
2  2018-01-02  google      us     70 -30.0
3  2018-01-03  google      us     60 -10.0

To fill NaN with 0 use df['diff'].fillna(0, inplace=True) .要用0填充NaN使用df['diff'].fillna(0, inplace=True)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM