简体   繁体   English

Pandas-binning在不同日期时间范围内的分钟数

[英]Pandas- binning number of minutes in various datetime ranges

I am looking for an efficient way to process the following data in pandas. 我正在寻找一种有效的方法来处理熊猫中的以下数据。

I have a dataframe containing several hundred thousand start and end timestamps: 我有一个包含数十万个开始和结束时间戳的数据帧:

data_df
                      start_ts                     end_ts
0    2019-06-10 12:00:00+00:00  2019-06-10 22:30:00+00:00
1    2019-06-11 12:00:00+00:00  2019-06-11 13:30:00+00:00
2    2019-06-11 14:00:00+00:00  2019-06-11 19:00:00+00:00
3    2019-06-14 12:00:00+00:00  2019-06-14 18:30:00+00:00
4    2019-06-10 12:00:00+00:00  2019-06-10 21:30:00+00:00
5    2019-06-11 12:00:00+00:00  2019-06-11 18:30:00+00:00
...

I also have a set of labeled time bins ( tp1 - tp10 ). 我还有一组标记的时间段( tp1 - tp10 )。 There are 10 bins each day, but the times of those bins can change from day to do (eg- tp1 might be from 00:00 to 01:30 on one day, but then 00:00 to 01:45 on another day). 每天有10个垃圾箱,但这些垃圾箱的时间可以从一天开始改变(例如, tp1可能是从一天的00:00到01:30,但是在另一天的00:00到01:45) 。 Each data set to be processed has 7 days, with 10 time periods per day, so the set of ranges is of size 70, and looks like: 每个要处理的数据集有7天,每天有10个时间段,因此范围集的大小为70,如下所示:

labeled_bins_df
                   start_range                  end_range  label
0    2019-06-10 00:00:00+00:00  2019-06-10 04:30:00+00:00    tp1
1    2019-06-10 04:30:00+00:00  2019-06-10 09:45:00+00:00    tp2
2    2019-06-10 09:45:00+00:00  2019-06-10 12:30:00+00:00    tp3
...

What I would like is a table with the original data_df data, but with additional columns, tp1 through tp10 , with the number of minutes each row: 我想要的是一个包含原始data_df数据的表,但是有额外的列, tp1tp10 ,每行的分钟数:

timed_bins
                      start_ts                     end_ts    tp1    tp2    tp3    tp4 ...
0    2019-06-10 12:00:00+00:00  2019-06-10 22:30:00+00:00      0      0     30    120 ...
1    2019-06-11 12:00:00+00:00  2019-06-11 13:30:00+00:00      0     45     45      0 ...

I am currently doing this naively, looping over my rows, and searching for the bins each data row is in, and as you can imagine, this is quite slow. 我目前正在天真地这样做,循环遍历我的行,并搜索每个数据行所在的箱子,你可以想象,这很慢。 Is there any pandas-fu that can be performed to do this kind of binning on datetime ranges? 是否有任何pandas-fu可以在日期时间范围内进行这种分类?

EDIT: A thought, which might help think in a new direction. 编辑:一个想法,可能有助于思考一个新的方向。 If I were to convert all of my timestamps (both in my data, and in my labeled bins) into unix timestamps (seconds since Jan 1, 1970), then it would be a matter of binning/summing based on integer ranges rather than dates. 如果我要将我的所有时间戳(包括我的数据和我的标记箱中)转换为unix时间戳(1970年1月1日以来的秒数),那么它将基于整数范围而不是日期进行分箱/求和。 This would then yield the number of seconds in each bin, simply divide by 60, and I get my minutes in each bin. 然后,这将产生每个箱中的秒数,简单地除以60,并且我在每个箱中得到我的分钟。 This takes away all concerns about date boundaries, etc. 这消除了对日期边界等的所有担忧。

EDIT 2: As requested, here is a set of simplified sample data, using three different time bins. 编辑2:根据要求,这里是一组简化的样本数据,使用三个不同的时间段。 I specifically made one of the data samples (the second row) span 2 days. 我专门制作了一个数据样本(第二行),为期2天。 Additionally, there is a result_df that shows the expected output. 此外,还有一个result_df显示预期的输出。

data_samples = [
    {'start_ts': '2019-06-10T12:00:00+0000', 'end_ts': '2019-06-10T22:30:00+0000'},
    {'start_ts': '2019-06-10T22:00:00+0000', 'end_ts': '2019-06-11T05:30:00+0000'},
    {'start_ts': '2019-06-10T10:00:00+0000', 'end_ts': '2019-06-10T14:15:00+0000'},
    {'start_ts': '2019-06-12T08:07:00+0000', 'end_ts': '2019-06-12T18:22:00+0000'},
    {'start_ts': '2019-06-11T14:03:00+0000', 'end_ts': '2019-06-11T15:30:00+0000'},
    {'start_ts': '2019-06-11T02:33:00+0000', 'end_ts': '2019-06-11T10:31:00+0000'}
]

data_set = [{
    'start_ts': datetime.datetime.strptime(x['start_ts'], '%Y-%m-%dT%H:%M:%S%z'),
    'end_ts': datetime.datetime.strptime(x['end_ts'], '%Y-%m-%dT%H:%M:%S%z')} for x in data_samples]

data_df = pd.DataFrame(data_set)[['start_ts', 'end_ts']]

time_bin_samples = [
    {'start_ts': '2019-06-10T00:00:00+0000', 'end_ts': '2019-06-10T08:15:00+0000', 'label': 't1'},
    {'start_ts': '2019-06-10T08:15:00+0000', 'end_ts': '2019-06-10T18:00:00+0000', 'label': 't2'},
    {'start_ts': '2019-06-10T18:00:00+0000', 'end_ts': '2019-06-11T00:00:00+0000', 'label': 't3'},

    {'start_ts': '2019-06-11T00:00:00+0000', 'end_ts': '2019-06-11T09:00:00+0000', 'label': 't1'},
    {'start_ts': '2019-06-11T09:00:00+0000', 'end_ts': '2019-06-11T19:15:00+0000', 'label': 't2'},
    {'start_ts': '2019-06-11T19:15:00+0000', 'end_ts': '2019-06-12T00:00:00+0000', 'label': 't3'},

    {'start_ts': '2019-06-12T00:00:00+0000', 'end_ts': '2019-06-12T10:30:00+0000', 'label': 't1'},
    {'start_ts': '2019-06-12T10:30:00+0000', 'end_ts': '2019-06-12T12:00:00+0000', 'label': 't2'},
    {'start_ts': '2019-06-12T12:00:00+0000', 'end_ts': '2019-06-13T00:00:00+0000', 'label': 't3'},
]

time_bin_set = [{
    'start_ts': datetime.datetime.strptime(x['start_ts'], '%Y-%m-%dT%H:%M:%S%z'),
    'end_ts': datetime.datetime.strptime(x['end_ts'], '%Y-%m-%dT%H:%M:%S%z'),
    'label': x['label']} for x in time_bin_samples
]

time_bin_df = pd.DataFrame(time_bin_set)[['start_ts', 'end_ts', 'label']]

result_set = [
    {'t1': 0, 't2': 360, 't3': 270},
    {'t1': 330, 't2': 0, 't3': 120},
    {'t1': 0, 't2': 255, 't3': 0},
    {'t1': 143, 't2': 90, 't3': 382},
    {'t1': 0, 't2': 87, 't3': 0},
    {'t1': 387, 't2': 91, 't3': 0}
]

result_df = pd.DataFrame(result_set)

I know that iterating the rows of a dataframe is not efficient. 我知道迭代数据帧的行效率不高。

Here I would try to identify the first and last bin per row in data_df by using merge_asof . 在这里,我会尝试找出每行第一个和最后一个窗口data_df使用merge_asof

Then I would build a list of sub-dataframes by iterating once the dataframe values in order to add all the bins corresponding to a row and would concat that list. 然后我将通过迭代一次数据帧值来构建一个子数据帧列表,以便添加对应于一行的所有二进制文件并将该列表连接起来。

From there it it enough to compute the time interval per bin and use pivot_table to get the expected result. 从那里它足以计算每个bin的时间间隔并使用pivot_table来获得预期的结果。

Code could be: 代码可以是:

# store the index as a column to make sure to keep it
data_df = data_df.rename_axis('ix').reset_index().sort_values(
    ['end_ts', 'start_ts'])
time_bin_df = time_bin_df.rename_axis('ix').reset_index().sort_values(
    ['end_ts', 'start_ts'])

# identify first and last bin per row
first = pd.merge_asof(data_df, time_bin_df, left_on='start_ts',
                      right_on='end_ts', suffixes=('', '_first'),
                      direction='forward').values
last = pd.merge_asof(data_df, time_bin_df, left_on='end_ts', right_on='start_ts',
                     suffixes=('', '_ bin')).values

# build a list of bin dataframes (one per row in data_df)
data = []
for i, val in enumerate(first):
    elt = time_bin_df[(time_bin_df['ix']>=val[3])
                      &(time_bin_df['ix']<=last[i][3])].copy()
    # compute the begin and end of the intersection of the period and the bin
    elt.loc[elt['start_ts']<val[1], 'start_ts'] = val[1]
    elt.loc[elt['end_ts']>val[2], 'end_ts'] = val[2]
    elt['ix_data'] = val[0]
    data.append(elt)

# concat everything
tmp = pd.concat(data)

# compute durations in minutes
tmp['duration'] = (tmp['end_ts'] - tmp['start_ts']).dt.total_seconds() / 60

# pivot to get the expected result
result_df = tmp.pivot_table('duration', 'ix_data', 'label', 'sum', fill_value=0
                            ).rename_axis(None).rename_axis(None, axis=1)

It can take some time because there is still one lengthy operation to build the list of dataframes, but other operations should be vectorized. 这可能需要一些时间,因为构建数据帧列表仍然需要一个冗长的操作,但其他操作应该是矢量化的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM