简体   繁体   English

将不规则时间间隔的Pandas数据框切成天边界

[英]Chopping up a Pandas data frame of irregular time intervals into day boundaries

I have a data frame that looks like the following: 我有一个数据框架,如下所示:

import pandas as pd
x = pd.DataFrame({'start_time': ['2012-01 23:00', '2012-02 02:00', '2012-02 05:00'], 'end_time': ['2012-02 02:00', '2012-02 05:00', '2012-02 9:00'], 'count': [3, 5, 1]})

'''
start_time,end_time,count
2012-01 23:00,2012-02 02:00,3
2012-02 01:00,2012-02 05:00,5
'''

For example, the first row might represent the fact that there were 3 sales between Jan 1 11p - Jan 2 1a. 例如,第一行可能表示一个事实,在1月1日1p至1月2 1a之间有3笔交易。

These time intervals cross day boundaries, but I want to be able to get a rough estimate of how many sales there were per day. 这些时间间隔跨越了一天的界限,但我希望能够对每天的销售量有一个大概的估计。 So in the example above, I want the row representing 3 sales between 11p-2a to be divided into two rows: 因此,在上面的示例中,我希望将表示11p-2a之间的3笔销售的行分为两行:

  1. One row from 11p-midnight, with 1 sale. 午夜11点至一排,有1笔交易。 (because there were originally 3 hours for 3 sales, and now there's only 1 hour, so 1/3 * 3 = 1) (因为最初3个小时有3笔交易,而现在只有1个小时,所以1/3 * 3 = 1)
  2. Another row from midnight-2a, with 2 sales. 午夜2a的另一排,有2笔销售。

Is there an easy way to do this? 是否有捷径可寻?

I couldn't think of a nice way to vectorize the answer, but here's a hack that gets the basic logic. 我想不出一种向量化答案的好方法,但是这里有个可以理解基本逻辑的技巧。 There's surely a way to generate something cleaner than this, but maybe this is all you need. 当然,有一种方法可以生成比这更干净的东西,但是也许这就是您所需要的。

x = pd.DataFrame({'start_time': ['2012-01-01 23:00', '2012-01-03 02:00', '2012-01-04 22:00'], 
                  'end_time': ['2012-01-02 02:00', '2012-01-03 05:00', '2012-01-05 2:00'], 
                  'count': [3, 5, 1]})
x['start_time'] = pd.to_datetime(x['start_time'])
x['end_time'] = pd.to_datetime(x['end_time'])

from collections import Counter
strip_time = lambda x: pd.datetime(x.year, x.month, x.day)

c = Counter()
for _, row in x.iterrows():
    if row['start_time'].day == row['end_time'].day:
        c[strip_time(row['start_time'])] += row['count']
    else:
        delta_t = row['end_time'] - row['start_time']
        c[strip_time(row['start_time'])] += row['count'] * (strip_time(row['end_time']) - row['start_time'])/delta_t
        c[strip_time(row['end_time'])] += row['count'] * (row['end_time'] - strip_time(row['end_time']))/delta_t

s = pd.Series(c)

# s:
2012-01-01    1.0
2012-01-02    2.0
2012-01-03    5.0
2012-01-04    0.5
2012-01-05    0.5
dtype: float64

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM