简体   繁体   English

Python / Pandas Binning Data Timedelta

[英]Python/Pandas Binning Data Timedelta

I have a DataFrame with two columns 我有一个包含两列的DataFrame

    userID     duration
0   DSm7ysk    03:08:49
1   no51CdJ    00:35:50
2   ...

with 'duration' having type timedelta. 'duration'具有timedelta类型。 I have tried using 我试过用

bins = [dt.timedelta(minutes = 0), dt.timedelta(minutes = 
        5),dt.timedelta(minutes = 10),dt.timedelta(minutes = 
        20),dt.timedelta(minutes = 30), dt.timedelta(hours = 4)]

labels = ['0-5min','5-10min','10-20min','20-30min','30min+']

df['bins'] = pd.cut(df['duration'], bins, labels = labels)

However, the binned data doesn't use the specified bins, but created on for each duration in the frame. 但是,分箱数据不使用指定的分箱,而是在帧中的每个持续时间内创建。

What is the simplest way to bin timedelta objects into irregular bins? 将timedelta对象分成不规则区间的最简单方法是什么? Or am I just missing something obvious here? 或者我只是错过了一些明显的东西?

It works for me with pandas 0.23.4 大熊猫0.23.4对我有用

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'userID': ['DSm7ysk', 'no51CdJ', 'foo', 'bar'],
    'duration': [pd.Timedelta('3 hours 8 minutes 49 seconds'), pd.Timedelta('35 minutes 50 seconds'), pd.Timedelta('1 minutes 13 seconds'), pd.Timedelta('6 minutes 43 seconds')]
})

bins = [
    pd.Timedelta(minutes = 0),
    pd.Timedelta(minutes = 5),
    pd.Timedelta(minutes = 10),
    pd.Timedelta(minutes = 20),
    pd.Timedelta(minutes = 30),
    pd.Timedelta(hours = 4)
]

labels = ['0-5min', '5-10min', '10-20min', '20-30min', '30min+']

df['bins'] = pd.cut(df['duration'], bins, labels = labels)

Result: 结果:

结果

You can normalize to seconds before binning. 您可以在装箱前将其标准化为秒。 This reduces the problem to binning integers. 这减少了对整数进行分箱的问题。

df = pd.DataFrame({'userID': ['A', 'B'],
                   'duration': pd.to_timedelta(['00:08:49', '00:35:50'])})

L = ['00:00:00', '00:05:00', '00:10:00', '00:20:00', '00:30:00', '04:00:00']

bins = pd.to_timedelta(L).total_seconds()
cats = ['0-5min', '5-10min', '10-20min', '20-30min', '30min+']

df['bins'] = pd.cut(df['duration'].dt.total_seconds(), bins, labels=cats)

print(df)

#    duration userID     bins
# 0  00:08:49      A  5-10min
# 1  00:35:50      B   30min+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM