简体   繁体   English

pandas - groupby 多个值?

[英]pandas - groupby multiple values?

i have a dataframe that contains cell phone minutes usage logged by date of call and duration.我有一个 dataframe,其中包含按通话日期和持续时间记录的手机分钟使用量。

It looks like this (30 row sample):它看起来像这样(30 行示例):

          id  user_id  call_date  duration
0    1000_93     1000 2018-12-27      8.52
1   1000_145     1000 2018-12-27     13.66
2   1000_247     1000 2018-12-27     14.48
3   1000_309     1000 2018-12-28      5.76
4   1000_380     1000 2018-12-30      4.22
5   1000_388     1000 2018-12-31      2.20
6   1000_510     1000 2018-12-27      5.75
7   1000_521     1000 2018-12-28     14.18
8   1000_530     1000 2018-12-28      5.77
9   1000_544     1000 2018-12-26      4.40
10  1000_693     1000 2018-12-31      4.31
11  1000_705     1000 2018-12-31     12.78
12  1000_735     1000 2018-12-29      1.70
13  1000_778     1000 2018-12-28      3.29
14  1000_826     1000 2018-12-26      9.96
15  1000_842     1000 2018-12-27      5.85
16    1001_0     1001 2018-09-06     10.06
17    1001_1     1001 2018-10-12      1.00
18    1001_2     1001 2018-10-17     15.83
19    1001_4     1001 2018-12-05      0.00
20    1001_5     1001 2018-12-13      6.27
21    1001_6     1001 2018-12-04      7.19
22    1001_8     1001 2018-11-17      2.45
23    1001_9     1001 2018-11-19      2.40
24   1001_11     1001 2018-11-09      1.00
25   1001_13     1001 2018-12-24      0.00
26   1001_19     1001 2018-11-15     30.00
27   1001_20     1001 2018-09-21      5.75
28   1001_23     1001 2018-10-27      0.98
29   1001_26     1001 2018-10-28      5.90
30   1001_29     1001 2018-09-30     14.78

I want to group by user_id AND call_date with the ultimate goal of calculating the number of minutes used per month over the course of the year, per user.我想按 user_id 和 call_date 分组,最终目标是计算每个用户在一年中每月使用的分钟数。

I thought i could accomplish this by using:我可以通过使用来完成这个:

calls.groupby(['user_id','call_date'])['duration'].sum()

but the results aren't what i expected:但结果不是我所期望的:

  user_id  call_date 
1000     2018-12-26    14.36
         2018-12-27    48.26
         2018-12-28    29.00
         2018-12-29     1.70
         2018-12-30     4.22
         2018-12-31    19.29
1001     2018-08-14    13.86
         2018-08-16    23.46
         2018-08-17     8.11
         2018-08-18     1.74
         2018-08-19    10.73
         2018-08-20     7.32
         2018-08-21     0.00
         2018-08-23     8.50
         2018-08-24     8.63
         2018-08-25    35.39
         2018-08-27    10.57
         2018-08-28    19.91
         2018-08-29     0.54
         2018-08-31    22.38
         2018-09-01     7.53
         2018-09-02    10.27
         2018-09-03    30.66
         2018-09-04     0.00
         2018-09-05     9.09
         2018-09-06    10.06

i'd hoped that it would be grouped like user_id 1000, all calls for jan with duration summed, all calls for feb with duration summed, etc.我希望它会像 user_id 1000 那样分组,所有对 jan 的调用加上持续时间总和,所有对 feb 的调用加上持续时间总和,等等。

i am really new to python and programming in general and am not sure what my next step should be to get these grouped by user_id and month of the year?我真的是 python 和一般编程的新手,我不确定下一步应该如何让这些按 user_id 和一年中的月份分组?

Thanks in advance for any insight you can offer.提前感谢您提供的任何见解。

Regards,问候,

Jared杰瑞德

Something is not quite right in your setup.您的设置中有些地方不太对劲。 First of all, both of your tables are the same, so I am not sure if this is a cut-and-paste error or something else.首先,你的两个表都是一样的,所以我不确定这是剪切和粘贴错误还是其他什么。 Here is what I do with your data.这是我对您的数据所做的事情。 Load it up like so, note we explicitly convert call_date to Datetime`像这样加载它,注意我们将call_date显式转换为 Datetime`

from io import StringIO
import pandas as pd
df = pd.read_csv(StringIO(
"""
          id  user_id  call_date  duration
0    1000_93     1000 2018-12-27      8.52
1   1000_145     1000 2018-12-27     13.66
2   1000_247     1000 2018-12-27     14.48
3   1000_309     1000 2018-12-28      5.76
4   1000_380     1000 2018-12-30      4.22
5   1000_388     1000 2018-12-31      2.20
6   1000_510     1000 2018-12-27      5.75
7   1000_521     1000 2018-12-28     14.18
8   1000_530     1000 2018-12-28      5.77
9   1000_544     1000 2018-12-26      4.40
10  1000_693     1000 2018-12-31      4.31
11  1000_705     1000 2018-12-31     12.78
12  1000_735     1000 2018-12-29      1.70
13  1000_778     1000 2018-12-28      3.29
14  1000_826     1000 2018-12-26      9.96
15  1000_842     1000 2018-12-27      5.85
16    1001_0     1001 2018-09-06     10.06
17    1001_1     1001 2018-10-12      1.00
18    1001_2     1001 2018-10-17     15.83
19    1001_4     1001 2018-12-05      0.00
20    1001_5     1001 2018-12-13      6.27
21    1001_6     1001 2018-12-04      7.19
22    1001_8     1001 2018-11-17      2.45
23    1001_9     1001 2018-11-19      2.40
24   1001_11     1001 2018-11-09      1.00
25   1001_13     1001 2018-12-24      0.00
26   1001_19     1001 2018-11-15     30.00
27   1001_20     1001 2018-09-21      5.75
28   1001_23     1001 2018-10-27      0.98
29   1001_26     1001 2018-10-28      5.90
30   1001_29     1001 2018-09-30     14.78
"""), delim_whitespace = True, index_col=0)
df['call_date'] = pd.to_datetime(df['call_date'])

Then using然后使用

df.groupby(['user_id','call_date'])['duration'].sum()

does the expected grouping by user and by each date:按用户和每个日期进行预期分组:

user_id  call_date 
1000     2018-12-26    14.36
         2018-12-27    48.26
         2018-12-28    29.00
         2018-12-29     1.70
         2018-12-30     4.22
         2018-12-31    19.29
1001     2018-09-06    10.06
         2018-09-21     5.75
         2018-09-30    14.78
         2018-10-12     1.00
         2018-10-17    15.83
         2018-10-27     0.98
         2018-10-28     5.90
         2018-11-09     1.00
         2018-11-15    30.00
         2018-11-17     2.45
         2018-11-19     2.40
         2018-12-04     7.19
         2018-12-05     0.00
         2018-12-13     6.27
         2018-12-24     0.00

If you want to group by month as you seem to suggest you can use the Grouper functionality:如果您想按照您的建议按月分组,您可以使用Grouper功能:

df.groupby(['user_id',pd.Grouper(key='call_date', freq='1M')])['duration'].sum()

which produces产生

user_id  call_date 
1000     2018-12-31    116.83
1001     2018-09-30     30.59
         2018-10-31     23.71
         2018-11-30     35.85
         2018-12-31     13.46

Let me know if you are getting different results from following these steps如果您按照这些步骤得到不同的结果,请告诉我

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM