简体   繁体   English

如何在 pandas 数据框上应用 groupby 两次?

[英]How I can apply groupby two times on pandas data frame?

I have pandas data frame with column 'year', 'month' and 'transaction id'.我有 pandas 数据框,其中包含“年”、“月”和“交易 ID”列。 I want to get the transaction count of every month for every year.我想获得每年每个月的交易计数。 For ex my data is like:对于前我的数据是这样的:

year: {2015,2015,2015,2016,2016,2017}
month: {1,  1,   2,   2,   2,    1}
tid: {123,  343, 453, 675, 786, 332}

I want to get the output such that for every year I will get the number of transactions per month.我想得到 output 这样每年我都会得到每月的交易数量。 For ex for year 2015 I will get the output:对于 2015 年,我将获得 output:

month: [1,2]
count: [2,1]

I used groupby('year').我使用了 groupby('year')。 but after that how I can get the per month transaction count.但在那之后我如何获得每月的交易计数。

You need groupby by both columns - year and month and then aggregate size : 你需要groupby两个列 - yearmonth ,然后聚合size

year = [2015,2015,2015,2016,2016,2017]
month =  [1,  1,   2,   2,   2,    1]
tid = [123,  343, 453, 675, 786, 332]

df = pd.DataFrame({'year':year, 'month':month,'tid':tid})
print (df)
   month  tid  year
0      1  123  2015
1      1  343  2015
2      2  453  2015
3      2  675  2016
4      2  786  2016
5      1  332  2017

df1 = df.groupby(['year','month'])['tid'].size().reset_index(name='count')
print (df1)
   year  month  count
0  2015      1      2
1  2015      2      1
2  2016      2      2
3  2017      1      1

Another option for more complex tasks - suppose you want to group by "year" and a function applied to "tid" - eg a bucket categorization更复杂任务的另一种选择 - 假设你想按“年”分组并将 function 应用于“tid” - 例如桶分类

def tidBucket(x):
   if x<300:             return "low"
   if (300<=x & x<700):  return "medium"
   if 700<=x:            return "high"

Then the above solution would not work.那么上述解决方案将不起作用。 You could solve the problem by first grouping by year, then iterate over the contents of the groupby object with another groupby:您可以通过首先按年份分组来解决问题,然后使用另一个 groupby 迭代 groupby object 的内容:

gb = df.groupby(by='year') #['tid'].size().reset_index(name='count')
for _,df1 in gb:
    df1.index = df1["tid"]
    df1 = df1.groupby(by=tidBucket)

Then aggregate as desired.然后根据需要聚合。 Alternatively, you could create an additional "bucket" column或者,您可以创建一个额外的“桶”列

df["bucket"] = df["tid"].map(tidBucket)

and follow the @jezrael 's solution.并遵循@jezrael 的解决方案。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM