[英]Grouping rows by time-range in Pandas dataframe
I have a large dataframe indexed by timestamps in which I would like to assign the rows to groups according to a time range. 我有一个由时间戳索引的大型数据帧,在该数据帧中,我想根据时间范围将行分配给组。
In the following data for example, I have grouped the rows within 1ms of the first entry in the group. 例如,在以下数据中,我将行分组到该组中第一个条目的1ms之内。
groupid
1999-12-31 23:59:59.000107 1
1999-12-31 23:59:59.000385 1
1999-12-31 23:59:59.000404 1
1999-12-31 23:59:59.000704 1
1999-12-31 23:59:59.001281 2
1999-12-31 23:59:59.002211 2
1999-12-31 23:59:59.002367 3
I have working code which does this by iterating the rows, and using the current row to slice the dataframe: 我有工作代码,它通过迭代行并使用当前行来切片数据帧来做到这一点:
dts = sorted([datetime(1999, 12, 31, 23, 59, 59, x) for
x in np.random.randint(1, 999999, 1000)])
df = pd.DataFrame({'groupid': None}, dts)
print df.head(20)
groupid = 1
for dt, row in df.iterrows():
if df.loc[row.name].groupid:
continue
end = dt + timedelta(milliseconds=1)
group = df.loc[dt:end]
df.loc[group.index, 'groupid'] = groupid
groupid += 1
print df.head(20)
However, as is common with iterrows, the operation is slow on large dataframes. 但是,与迭代操作一样,在大型数据帧上操作速度很慢。 I've made various attempts at applying a function and using groupby, but without success.
我在应用功能和使用groupby方面进行了各种尝试,但均未成功。 Is using itertuples the best I can do for a performance boost(which I'm going to try now)?
使用itertuples是我为提高性能所做的最好的工作(我现在将尝试)? Could someone give some advice please?
有人可以给点建议吗?
OK, I think the following is what you want, this constructs a TimeDelta from your index by subtracting all values by the first value. 好的,我想以下是您想要的,它通过将所有值减去第一个值,从索引中构造一个TimeDelta。 We then access the microseconds component and divide by 1000 and then cast the Series dtype to int:
然后,我们访问微秒组件并除以1000,然后将Series dtype强制转换为int:
In [86]:
df['groupid'] = ((df.index.to_series() - df.index[0]).dt.microseconds / 1000).astype(np.int32)
df
Out[86]:
groupid
1999-12-31 23:59:59.000133 0
1999-12-31 23:59:59.000584 0
1999-12-31 23:59:59.003544 3
1999-12-31 23:59:59.009193 9
1999-12-31 23:59:59.010220 10
1999-12-31 23:59:59.010632 10
1999-12-31 23:59:59.010716 10
1999-12-31 23:59:59.011387 11
1999-12-31 23:59:59.011837 11
1999-12-31 23:59:59.013277 13
1999-12-31 23:59:59.013305 13
1999-12-31 23:59:59.014754 14
1999-12-31 23:59:59.016015 15
1999-12-31 23:59:59.016067 15
1999-12-31 23:59:59.017788 17
1999-12-31 23:59:59.018236 18
1999-12-31 23:59:59.021281 21
1999-12-31 23:59:59.021772 21
1999-12-31 23:59:59.021927 21
1999-12-31 23:59:59.022200 22
1999-12-31 23:59:59.023104 22
1999-12-31 23:59:59.023375 23
1999-12-31 23:59:59.023688 23
1999-12-31 23:59:59.023726 23
1999-12-31 23:59:59.025397 25
1999-12-31 23:59:59.026407 26
1999-12-31 23:59:59.026480 26
1999-12-31 23:59:59.027825 27
1999-12-31 23:59:59.028793 28
1999-12-31 23:59:59.030716 30
... ...
1999-12-31 23:59:59.975432 975
1999-12-31 23:59:59.976699 976
1999-12-31 23:59:59.977177 977
1999-12-31 23:59:59.979475 979
1999-12-31 23:59:59.980282 980
1999-12-31 23:59:59.980672 980
1999-12-31 23:59:59.983202 983
1999-12-31 23:59:59.984214 984
1999-12-31 23:59:59.984674 984
1999-12-31 23:59:59.984933 984
1999-12-31 23:59:59.985664 985
1999-12-31 23:59:59.985779 985
1999-12-31 23:59:59.988812 988
1999-12-31 23:59:59.989324 989
1999-12-31 23:59:59.990386 990
1999-12-31 23:59:59.990485 990
1999-12-31 23:59:59.990969 990
1999-12-31 23:59:59.991255 991
1999-12-31 23:59:59.991739 991
1999-12-31 23:59:59.993979 993
1999-12-31 23:59:59.994705 994
1999-12-31 23:59:59.994874 994
1999-12-31 23:59:59.995397 995
1999-12-31 23:59:59.995753 995
1999-12-31 23:59:59.995863 995
1999-12-31 23:59:59.996574 996
1999-12-31 23:59:59.998139 998
1999-12-31 23:59:59.998533 998
1999-12-31 23:59:59.998778 998
1999-12-31 23:59:59.999915 999
Thanks to @Jeff for pointing out the much cleaner method: 感谢@Jeff指出了更简洁的方法:
In [96]:
df['groupid'] = (df.index-df.index[0]).astype('timedelta64[ms]')
df
Out[96]:
groupid
1999-12-31 23:59:59.000884 0
1999-12-31 23:59:59.001175 0
1999-12-31 23:59:59.001262 0
1999-12-31 23:59:59.001540 0
1999-12-31 23:59:59.001769 0
1999-12-31 23:59:59.002478 1
1999-12-31 23:59:59.005001 4
1999-12-31 23:59:59.005497 4
1999-12-31 23:59:59.006908 6
1999-12-31 23:59:59.008860 7
1999-12-31 23:59:59.009257 8
1999-12-31 23:59:59.010012 9
1999-12-31 23:59:59.011451 10
1999-12-31 23:59:59.013177 12
1999-12-31 23:59:59.014138 13
1999-12-31 23:59:59.015795 14
1999-12-31 23:59:59.015865 14
1999-12-31 23:59:59.016069 15
1999-12-31 23:59:59.016666 15
1999-12-31 23:59:59.016718 15
1999-12-31 23:59:59.019058 18
1999-12-31 23:59:59.019675 18
1999-12-31 23:59:59.020747 19
1999-12-31 23:59:59.021856 20
1999-12-31 23:59:59.022959 22
1999-12-31 23:59:59.023812 22
1999-12-31 23:59:59.023938 23
1999-12-31 23:59:59.024122 23
1999-12-31 23:59:59.025332 24
1999-12-31 23:59:59.025397 24
... ...
1999-12-31 23:59:59.959725 958
1999-12-31 23:59:59.959742 958
1999-12-31 23:59:59.959892 959
1999-12-31 23:59:59.960345 959
1999-12-31 23:59:59.960800 959
1999-12-31 23:59:59.961054 960
1999-12-31 23:59:59.962749 961
1999-12-31 23:59:59.965681 964
1999-12-31 23:59:59.966409 965
1999-12-31 23:59:59.966558 965
1999-12-31 23:59:59.967357 966
1999-12-31 23:59:59.967842 966
1999-12-31 23:59:59.970465 969
1999-12-31 23:59:59.974022 973
1999-12-31 23:59:59.974734 973
1999-12-31 23:59:59.975879 974
1999-12-31 23:59:59.978291 977
1999-12-31 23:59:59.980483 979
1999-12-31 23:59:59.980868 979
1999-12-31 23:59:59.981417 980
1999-12-31 23:59:59.984208 983
1999-12-31 23:59:59.984639 983
1999-12-31 23:59:59.985533 984
1999-12-31 23:59:59.986785 985
1999-12-31 23:59:59.987502 986
1999-12-31 23:59:59.987914 987
1999-12-31 23:59:59.988406 987
1999-12-31 23:59:59.989436 988
1999-12-31 23:59:59.994449 993
1999-12-31 23:59:59.996657 995
This is like a resample operation. 这就像重新采样操作。
Create your data 建立资料
In [39]: pd.set_option('max_rows',12)
In [40]: np.random.seed(11111)
In [41]: dts = sorted([datetime(1999, 12, 31, 23, 59, 59, x) for
x in np.random.randint(1, 999999, 1000)])
In [42]: df = pd.DataFrame({'groupid': np.random.randn(len(dts))}, dts)
So simply grouping gives you the groups directly. 因此,简单地分组即可直接给您分组。 You can iterate as this is a generator.
您可以迭代,因为这是一个生成器。
In [43]: list(df.groupby(pd.Grouper(freq='ms')))[0:3]
Out[43]:
[(Timestamp('1999-12-31 23:59:59', offset='L'),
groupid
1999-12-31 23:59:59.000789 -1.369503
1999-12-31 23:59:59.000814 0.776049),
(Timestamp('1999-12-31 23:59:59.001000', offset='L'),
groupid
1999-12-31 23:59:59.001041 -0.374915
1999-12-31 23:59:59.001062 -1.470845),
(Timestamp('1999-12-31 23:59:59.002000', offset='L'),
groupid
1999-12-31 23:59:59.002355 -0.240954)]
Might be simpler just to resample. 重新采样可能会更简单。 You can use a custom function for
how
. 您可以使用自定义功能
how
。
In [44]: df.resample('ms',how='sum')
Out[44]:
groupid
1999-12-31 23:59:59.000 -0.593454
1999-12-31 23:59:59.001 -1.845759
1999-12-31 23:59:59.002 -0.240954
1999-12-31 23:59:59.003 1.291403
1999-12-31 23:59:59.004 NaN
1999-12-31 23:59:59.005 0.291484
... ...
1999-12-31 23:59:59.994 NaN
1999-12-31 23:59:59.995 NaN
1999-12-31 23:59:59.996 NaN
1999-12-31 23:59:59.997 -0.445052
1999-12-31 23:59:59.998 NaN
1999-12-31 23:59:59.999 -0.895305
[1000 rows x 1 columns]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.