简体   繁体   English

熊猫分组日期范围和多列上的不同计算

[英]Pandas group-by date range & different calculations on multiple columns

I'm having troubles with grouping a pandas df by time range and different calculations by column: 我在按时间范围将pandas df分组以及按列进行不同的计算时遇到麻烦:

Let's take following df: 让我们看一下df:

           date          identifier    value_1    value_2
0     05.07.2018 16:35       A           10          0
1     05.07.2018 16:36       B           20          1
2     05.07.2018 16:37       A           20          2
3     05.07.2018 16:39       B           30          1
4     05.07.2018 16:40       A           40          3
5     05.07.2018 16:41       B           20          2
6     05.07.2018 16:41       A           30          1
7     05.07.2018 16:42       B           50          2
8     05.07.2018 16:43       B           20          3
9     05.07.2018 16:44       A           20          1

As a result I need a df, which is grouped by time in 5 mins intervals and by identifier, with the average of value_1 and the sum of value_2: 结果,我需要一个df,它按5分钟间隔的时间和标识符分组,其平均值为value_1和value_2之和:

           date          identifier    value_1    value_2
0     05.07.2018 16:35       A           15          2
1     05.07.2018 16:35       B           25          2
2     05.07.2018 16:40       A           30          5
3     05.07.2018 16:40       B           30          7

How can I do this the most efficient way in pandas? 如何在熊猫中最有效地做到这一点?

THX & BR from Vienna 维也纳的THX和BR

you can use groupby , pd.Grouper , and agg , after setting your date column to datetime with the proper format: 在将date列设置为具有适当格式的datetime之后,可以使用groupbypd.Grouperagg

# Set date to datetime format. I'm assuming it's day.month.year in your original dataframe
df['date'] = pd.to_datetime(df.date, format = '%d.%m.%Y %H:%M')

new_df = (df.groupby(['identifier', pd.Grouper(key='date', freq='5min')])
          .agg({'value_1':'mean', 'value_2':'sum'}))

>>> new_df
                                value_1  value_2
identifier date                                 
A          2018-07-05 16:35:00       15        2
           2018-07-05 16:40:00       30        5
B          2018-07-05 16:35:00       25        2
           2018-07-05 16:40:00       30        7

If you want the same format as your desired output in your post, you can use this to sort: 如果您想要与帖子中所需输出相同的格式,则可以使用以下格式进行排序:

new_df.reset_index().sort_values(['date','identifier'])

  identifier                date  value_1  value_2
0          A 2018-07-05 16:35:00       15        2
2          B 2018-07-05 16:35:00       25        2
1          A 2018-07-05 16:40:00       30        5
3          B 2018-07-05 16:40:00       30        7

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM