简体   繁体   English

在Pandas数据框中按时间范围对行进行分组

[英]Grouping rows by time-range in Pandas dataframe

I have a large dataframe indexed by timestamps in which I would like to assign the rows to groups according to a time range. 我有一个由时间戳索引的大型数据帧,在该数据帧中,我想根据时间范围将行分配给组。

In the following data for example, I have grouped the rows within 1ms of the first entry in the group. 例如,在以下数据中,我将行分组到该组中第一个条目的1ms之内。

                           groupid
1999-12-31 23:59:59.000107       1
1999-12-31 23:59:59.000385       1
1999-12-31 23:59:59.000404       1
1999-12-31 23:59:59.000704       1
1999-12-31 23:59:59.001281       2
1999-12-31 23:59:59.002211       2
1999-12-31 23:59:59.002367       3

I have working code which does this by iterating the rows, and using the current row to slice the dataframe: 我有工作代码,它通过迭代行并使用当前行来切片数据帧来做到这一点:

dts = sorted([datetime(1999, 12, 31, 23, 59, 59, x) for
              x in np.random.randint(1, 999999, 1000)])
df = pd.DataFrame({'groupid': None}, dts)

print df.head(20)

groupid = 1
for dt, row in df.iterrows():
    if df.loc[row.name].groupid:
        continue
    end = dt + timedelta(milliseconds=1)
    group = df.loc[dt:end]
    df.loc[group.index, 'groupid'] = groupid
    groupid += 1

print df.head(20)

However, as is common with iterrows, the operation is slow on large dataframes. 但是,与迭代操作一样,在大型数据帧上操作速度很慢。 I've made various attempts at applying a function and using groupby, but without success. 我在应用功能和使用groupby方面进行了各种尝试,但均未成功。 Is using itertuples the best I can do for a performance boost(which I'm going to try now)? 使用itertuples是我为提高性能所做的最好的工作(我现在将尝试)? Could someone give some advice please? 有人可以给点建议吗?

OK, I think the following is what you want, this constructs a TimeDelta from your index by subtracting all values by the first value. 好的,我想以下是您想要的,它通过将所有值减去第一个值,从索引中构造一个TimeDelta。 We then access the microseconds component and divide by 1000 and then cast the Series dtype to int: 然后,我们访问微秒组件并除以1000,然后将Series dtype强制转换为int:

In [86]:

df['groupid'] = ((df.index.to_series() - df.index[0]).dt.microseconds / 1000).astype(np.int32)
df
Out[86]:
                            groupid
1999-12-31 23:59:59.000133        0
1999-12-31 23:59:59.000584        0
1999-12-31 23:59:59.003544        3
1999-12-31 23:59:59.009193        9
1999-12-31 23:59:59.010220       10
1999-12-31 23:59:59.010632       10
1999-12-31 23:59:59.010716       10
1999-12-31 23:59:59.011387       11
1999-12-31 23:59:59.011837       11
1999-12-31 23:59:59.013277       13
1999-12-31 23:59:59.013305       13
1999-12-31 23:59:59.014754       14
1999-12-31 23:59:59.016015       15
1999-12-31 23:59:59.016067       15
1999-12-31 23:59:59.017788       17
1999-12-31 23:59:59.018236       18
1999-12-31 23:59:59.021281       21
1999-12-31 23:59:59.021772       21
1999-12-31 23:59:59.021927       21
1999-12-31 23:59:59.022200       22
1999-12-31 23:59:59.023104       22
1999-12-31 23:59:59.023375       23
1999-12-31 23:59:59.023688       23
1999-12-31 23:59:59.023726       23
1999-12-31 23:59:59.025397       25
1999-12-31 23:59:59.026407       26
1999-12-31 23:59:59.026480       26
1999-12-31 23:59:59.027825       27
1999-12-31 23:59:59.028793       28
1999-12-31 23:59:59.030716       30
...                             ...
1999-12-31 23:59:59.975432      975
1999-12-31 23:59:59.976699      976
1999-12-31 23:59:59.977177      977
1999-12-31 23:59:59.979475      979
1999-12-31 23:59:59.980282      980
1999-12-31 23:59:59.980672      980
1999-12-31 23:59:59.983202      983
1999-12-31 23:59:59.984214      984
1999-12-31 23:59:59.984674      984
1999-12-31 23:59:59.984933      984
1999-12-31 23:59:59.985664      985
1999-12-31 23:59:59.985779      985
1999-12-31 23:59:59.988812      988
1999-12-31 23:59:59.989324      989
1999-12-31 23:59:59.990386      990
1999-12-31 23:59:59.990485      990
1999-12-31 23:59:59.990969      990
1999-12-31 23:59:59.991255      991
1999-12-31 23:59:59.991739      991
1999-12-31 23:59:59.993979      993
1999-12-31 23:59:59.994705      994
1999-12-31 23:59:59.994874      994
1999-12-31 23:59:59.995397      995
1999-12-31 23:59:59.995753      995
1999-12-31 23:59:59.995863      995
1999-12-31 23:59:59.996574      996
1999-12-31 23:59:59.998139      998
1999-12-31 23:59:59.998533      998
1999-12-31 23:59:59.998778      998
1999-12-31 23:59:59.999915      999

Thanks to @Jeff for pointing out the much cleaner method: 感谢@Jeff指出了更简洁的方法:

In [96]:
df['groupid'] = (df.index-df.index[0]).astype('timedelta64[ms]')
df

Out[96]:
                            groupid
1999-12-31 23:59:59.000884        0
1999-12-31 23:59:59.001175        0
1999-12-31 23:59:59.001262        0
1999-12-31 23:59:59.001540        0
1999-12-31 23:59:59.001769        0
1999-12-31 23:59:59.002478        1
1999-12-31 23:59:59.005001        4
1999-12-31 23:59:59.005497        4
1999-12-31 23:59:59.006908        6
1999-12-31 23:59:59.008860        7
1999-12-31 23:59:59.009257        8
1999-12-31 23:59:59.010012        9
1999-12-31 23:59:59.011451       10
1999-12-31 23:59:59.013177       12
1999-12-31 23:59:59.014138       13
1999-12-31 23:59:59.015795       14
1999-12-31 23:59:59.015865       14
1999-12-31 23:59:59.016069       15
1999-12-31 23:59:59.016666       15
1999-12-31 23:59:59.016718       15
1999-12-31 23:59:59.019058       18
1999-12-31 23:59:59.019675       18
1999-12-31 23:59:59.020747       19
1999-12-31 23:59:59.021856       20
1999-12-31 23:59:59.022959       22
1999-12-31 23:59:59.023812       22
1999-12-31 23:59:59.023938       23
1999-12-31 23:59:59.024122       23
1999-12-31 23:59:59.025332       24
1999-12-31 23:59:59.025397       24
...                             ...
1999-12-31 23:59:59.959725      958
1999-12-31 23:59:59.959742      958
1999-12-31 23:59:59.959892      959
1999-12-31 23:59:59.960345      959
1999-12-31 23:59:59.960800      959
1999-12-31 23:59:59.961054      960
1999-12-31 23:59:59.962749      961
1999-12-31 23:59:59.965681      964
1999-12-31 23:59:59.966409      965
1999-12-31 23:59:59.966558      965
1999-12-31 23:59:59.967357      966
1999-12-31 23:59:59.967842      966
1999-12-31 23:59:59.970465      969
1999-12-31 23:59:59.974022      973
1999-12-31 23:59:59.974734      973
1999-12-31 23:59:59.975879      974
1999-12-31 23:59:59.978291      977
1999-12-31 23:59:59.980483      979
1999-12-31 23:59:59.980868      979
1999-12-31 23:59:59.981417      980
1999-12-31 23:59:59.984208      983
1999-12-31 23:59:59.984639      983
1999-12-31 23:59:59.985533      984
1999-12-31 23:59:59.986785      985
1999-12-31 23:59:59.987502      986
1999-12-31 23:59:59.987914      987
1999-12-31 23:59:59.988406      987
1999-12-31 23:59:59.989436      988
1999-12-31 23:59:59.994449      993
1999-12-31 23:59:59.996657      995

This is like a resample operation. 这就像重新采样操作。

Create your data 建立资料

In [39]: pd.set_option('max_rows',12)

In [40]: np.random.seed(11111)

In [41]: dts = sorted([datetime(1999, 12, 31, 23, 59, 59, x) for
              x in np.random.randint(1, 999999, 1000)])

In [42]: df = pd.DataFrame({'groupid': np.random.randn(len(dts))}, dts)

So simply grouping gives you the groups directly. 因此,简单地分组即可直接给您分组。 You can iterate as this is a generator. 您可以迭代,因为这是一个生成器。

In [43]: list(df.groupby(pd.Grouper(freq='ms')))[0:3]
Out[43]: 
[(Timestamp('1999-12-31 23:59:59', offset='L'),
                               groupid
  1999-12-31 23:59:59.000789 -1.369503
  1999-12-31 23:59:59.000814  0.776049),
 (Timestamp('1999-12-31 23:59:59.001000', offset='L'),
                               groupid
  1999-12-31 23:59:59.001041 -0.374915
  1999-12-31 23:59:59.001062 -1.470845),
 (Timestamp('1999-12-31 23:59:59.002000', offset='L'),
                               groupid
  1999-12-31 23:59:59.002355 -0.240954)]

Might be simpler just to resample. 重新采样可能会更简单。 You can use a custom function for how . 您可以使用自定义功能how

In [44]: df.resample('ms',how='sum')
Out[44]: 
                          groupid
1999-12-31 23:59:59.000 -0.593454
1999-12-31 23:59:59.001 -1.845759
1999-12-31 23:59:59.002 -0.240954
1999-12-31 23:59:59.003  1.291403
1999-12-31 23:59:59.004       NaN
1999-12-31 23:59:59.005  0.291484
...                           ...
1999-12-31 23:59:59.994       NaN
1999-12-31 23:59:59.995       NaN
1999-12-31 23:59:59.996       NaN
1999-12-31 23:59:59.997 -0.445052
1999-12-31 23:59:59.998       NaN
1999-12-31 23:59:59.999 -0.895305

[1000 rows x 1 columns]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM