[英]Python re-sampling time series data which can not be indexed
这个问题的目的是要知道每秒发生了多少笔交易(计数)以及总交易量(总和)。
我有无法编制索引的时间序列数据(因为存在多个具有相同时间戳的条目-可以在同一毫秒获得很多交易),因此无法使用此处说明的重新采样 。
另一种方法是首先通过一次做组如图这里 (和以后每秒重新取样)。 问题在于,分组将仅对分组的项目造成一种基本算术(我只能求和/均值/标准等),而在此数据中,我需要将“ tradeVolume”列按总和分组,而将列“ ask1”按均值分组。
所以我的问题是:1.如何对每列使用不同的算法进行group by
,如果不可能的话,还有其他方法可以将毫秒数据重新采样为秒,而没有datetime索引。
谢谢!
时间序列(样本)在这里:
SecurityID,dateTime,ask1,ask1Volume,bid1,bid1Volume,ask2,ask2Volume,bid2,bid2Volume,ask3,ask3Volume,bid3,bid3Volume,tradePrice,tradeVolume,isTrade
2318276,2017-11-20 08:00:09.052240,12869.0,1,12868.0,3,12870.0,19,12867.5,2,12872.5,2,12867.0,1,0.0,0,0
2318276,2017-11-20 08:00:09.052260,12869.0,1,12868.0,3,12870.0,19,12867.5,2,12872.5,2,12867.0,1,12861.0,1,1
2318276,2017-11-20 08:00:09.052260,12869.0,1,12868.0,2,12870.0,19,12867.5,2,12872.5,2,12867.0,1,12868.0,1,0
2318276,2017-11-20 08:00:09.052270,12869.0,1,12868.0,2,12870.0,19,12867.5,2,12872.5,2,12867.0,1,12868.0,1,1
2318276,2017-11-20 08:00:09.052270,12869.0,1,12868.0,1,12870.0,19,12867.5,2,12872.5,2,12867.0,1,12868.0,1,0
2318276,2017-11-20 08:00:09.052282,12869.0,1,12868.0,1,12870.0,19,12867.5,2,12872.5,2,12867.0,1,12868.0,1,1
2318276,2017-11-20 08:00:09.052282,12869.0,1,12867.5,2,12870.0,19,12867.0,1,12872.5,2,12865.5,1,12868.0,1,0
2318276,2017-11-20 08:00:09.052291,12869.0,1,12867.5,2,12870.0,19,12867.0,1,12872.5,2,12865.5,1,12867.5,1,1
2318276,2017-11-20 08:00:09.052291,12869.0,1,12867.5,1,12870.0,19,12867.0,1,12872.5,2,12865.5,1,12867.5,1,0
2318276,2017-11-20 08:00:09.052315,12869.0,1,12867.5,1,12870.0,19,12867.0,1,12872.5,2,12865.5,1,12867.5,1,1
2318276,2017-11-20 08:00:09.052315,12869.0,1,12867.5,1,12870.0,19,12867.0,1,12872.5,2,12865.5,1,12867.0,1,1
2318276,2017-11-20 08:00:09.052315,12869.0,1,12867.5,1,12870.0,19,12867.0,1,12872.5,2,12865.5,1,12865.5,1,1
2318276,2017-11-20 08:00:09.052315,12869.0,1,12867.5,1,12870.0,19,12867.0,1,12872.5,2,12865.5,1,12865.0,1,1
2318276,2017-11-20 08:00:09.052315,12869.0,1,12867.5,1,12870.0,19,12867.0,1,12872.5,2,12865.5,1,12864.0,1,1
2318276,2017-11-20 08:00:09.052315,12869.0,1,12861.5,2,12870.0,19,12861.0,1,12872.5,2,12860.0,5,12864.0,1,0
2318276,2017-11-20 08:00:09.052335,12869.0,1,12861.5,2,12870.0,19,12861.0,1,12872.5,2,12860.0,5,12861.5,1,1
2318276,2017-11-20 08:00:09.052335,12869.0,1,12861.5,1,12870.0,19,12861.0,1,12872.5,2,12860.0,5,12861.5,1,0
2318276,2017-11-20 08:00:09.052348,12869.0,1,12861.5,1,12870.0,19,12861.0,1,12872.5,2,12860.0,5,12861.5,1,1
2318276,2017-11-20 08:00:09.052348,12869.0,1,12861.0,1,12870.0,19,12860.0,5,12872.5,2,12859.5,3,12861.5,1,0
2318276,2017-11-20 08:00:09.052357,12869.0,1,12861.0,1,12870.0,19,12860.0,5,12872.5,2,12859.5,3,12861.0,1,1
2318276,2017-11-20 08:00:09.052357,12869.0,1,12860.0,5,12870.0,19,12859.5,3,12872.5,2,12858.0,1,12861.0,1,0
首先,您需要有一秒钟的列(自纪元开始),然后使用该列进行
groupby
,然后对所需的列进行汇总。
您希望将时间戳降低到一秒的精度,并使用该精度进行分组。 然后应用聚合以获得所需的均值/和/ std
df = pd.read_csv('data.csv')
df['dateTime'] = df['dateTime'].astype('datetime64[s]')
groups = df.groupby('dateTime')
groups.agg({'ask1': np.mean, 'tradeVolume': np.sum})
我修改了数据以确保其中实际上有不同的秒数,
SecurityID,dateTime,ask1,ask1Volume,bid1,bid1Volume,ask2,ask2Volume,bid2,bid2Volume,ask3,ask3Volume,bid3,bid3Volume,tradePrice,tradeVolume,isTrade
2318276,2017-11-20 08:00:09.052240,12869.0,1,12868.0,3,12870.0,19,12867.5,2,12872.5,2,12867.0,1,0.0,0,0
2318276,2017-11-20 08:00:09.052260,12869.0,1,12868.0,3,12870.0,19,12867.5,2,12872.5,2,12867.0,1,12861.0,1,1
2318276,2017-11-20 08:00:09.052260,12869.0,1,12868.0,2,12870.0,19,12867.5,2,12872.5,2,12867.0,1,12868.0,1,0
2318276,2017-11-20 08:00:09.052270,12869.0,1,12868.0,2,12870.0,19,12867.5,2,12872.5,2,12867.0,1,12868.0,1,1
2318276,2017-11-20 08:00:09.052270,12869.0,1,12868.0,1,12870.0,19,12867.5,2,12872.5,2,12867.0,1,12868.0,1,0
2318276,2017-11-20 08:00:09.052282,12869.0,1,12868.0,1,12870.0,19,12867.5,2,12872.5,2,12867.0,1,12868.0,1,1
2318276,2017-11-20 08:00:09.052282,12869.0,1,12867.5,2,12870.0,19,12867.0,1,12872.5,2,12865.5,1,12868.0,1,0
2318276,2017-11-20 08:00:09.052291,12869.0,1,12867.5,2,12870.0,19,12867.0,1,12872.5,2,12865.5,1,12867.5,1,1
2318276,2017-11-20 08:00:09.052291,12869.0,1,12867.5,1,12870.0,19,12867.0,1,12872.5,2,12865.5,1,12867.5,1,0
2318276,2017-11-20 08:00:09.052315,12869.0,1,12867.5,1,12870.0,19,12867.0,1,12872.5,2,12865.5,1,12867.5,1,1
2318276,2017-11-20 08:00:09.052315,12869.0,1,12867.5,1,12870.0,19,12867.0,1,12872.5,2,12865.5,1,12867.0,1,1
2318276,2017-11-20 08:00:10.052315,12869.0,1,12867.5,1,12870.0,19,12867.0,1,12872.5,2,12865.5,1,12865.5,1,1
2318276,2017-11-20 08:00:10.052315,12869.0,1,12867.5,1,12870.0,19,12867.0,1,12872.5,2,12865.5,1,12865.0,1,1
2318276,2017-11-20 08:00:10.052315,12869.0,1,12867.5,1,12870.0,19,12867.0,1,12872.5,2,12865.5,1,12864.0,1,1
2318276,2017-11-20 08:00:10.052315,12869.0,1,12861.5,2,12870.0,19,12861.0,1,12872.5,2,12860.0,5,12864.0,1,0
2318276,2017-11-20 08:00:10.052335,12869.0,1,12861.5,2,12870.0,19,12861.0,1,12872.5,2,12860.0,5,12861.5,1,1
2318276,2017-11-20 08:00:10.052335,12869.0,1,12861.5,1,12870.0,19,12861.0,1,12872.5,2,12860.0,5,12861.5,1,0
2318276,2017-11-20 08:00:10.052348,12869.0,1,12861.5,1,12870.0,19,12861.0,1,12872.5,2,12860.0,5,12861.5,1,1
2318276,2017-11-20 08:00:10.052348,12869.0,1,12861.0,1,12870.0,19,12860.0,5,12872.5,2,12859.5,3,12861.5,1,0
2318276,2017-11-20 08:00:10.052357,12869.0,1,12861.0,1,12870.0,19,12860.0,5,12872.5,2,12859.5,3,12861.0,1,1
2318276,2017-11-20 08:00:10.052357,12869.0,1,12860.0,5,12870.0,19,12859.5,3,12872.5,2,12858.0,1,12861.0,1,0
和输出
In [53]: groups.agg({'ask1': np.mean, 'tradeVolume': np.sum})
Out[53]:
ask1 tradeVolume
seconds
1511164809 12869.0 10
1511164810 12869.0 10
脚注
OP表示原始版本(如下)速度更快,所以我花了一些时间
def test1(df):
"""This is the fastest and cleanest."""
df['dateTime'] = df['dateTime'].astype('datetime64[s]')
groups = df.groupby('dateTime')
agg = groups.agg({'ask1': np.mean, 'tradeVolume': np.sum})
def test2(df):
"""Totally unnecessary amount of datetime floors."""
def group_by_second(index_loc):
return df.loc[index_loc, 'dateTime'].floor('S')
df['dateTime'] = df['dateTime'].astype('datetime64[ns]')
groups = df.groupby(group_by_second)
result = groups.agg({'ask1': np.mean, 'tradeVolume': np.sum})
def test3(df):
"""Original version, but the conversion to/from nanoseconds is unnecessary."""
df['dateTime'] = df['dateTime'].astype('datetime64[ns]')
df['seconds'] = df['dateTime'].apply(lambda v: v.value // 1e9)
groups = df.groupby('dateTime')
agg = groups.agg({'ask1': np.mean, 'tradeVolume': np.sum})
if __name__ == '__main__':
import timeit
print('22 rows')
df = pd.read_csv('data_small.csv')
print('test1', timeit.repeat("test1(df.copy())", number=50, globals=globals()))
print('test2', timeit.repeat("test2(df.copy())", number=50, globals=globals()))
print('test3', timeit.repeat("test3(df.copy())", number=50, globals=globals()))
print('220 rows')
df = pd.read_csv('data.csv')
print('test1', timeit.repeat("test1(df.copy())", number=50, globals=globals()))
print('test2', timeit.repeat("test2(df.copy())", number=50, globals=globals()))
print('test3', timeit.repeat("test3(df.copy())", number=50, globals=globals()))
我在两个数据集上进行了测试,结果是第一个数据集的10倍
22 rows
test1 [0.08138518501073122, 0.07786444900557399, 0.0775048139039427]
test2 [0.2644687460269779, 0.26298125297762454, 0.2618108610622585]
test3 [0.10624988097697496, 0.1028324980288744, 0.10304366517812014]
220 rows
test1 [0.07999306707642972, 0.07842653687112033, 0.07848454895429313]
test2 [1.9794962559826672, 1.966513831866905, 1.9625889619346708]
test3 [0.12691736104898155, 0.12642419710755348, 0.126510804053396]
因此,最好使用.astype('datetime[s]')
版本,因为这是最快的,并且可以最佳扩展。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.