简体   繁体   English

删除 Pandas 中特定数据帧的时间序列数据中的滞后/差距

[英]Remove lags/gaps in time series data for a particular dataframe in pandas

I am trying to model a distribution for a particular stock in terms of the amount of times the order book is updated at a certain timeframe.我正在尝试根据订单簿在特定时间范围内更新的次数对特定股票的分布进行建模。

The issue I am having is in relation to data engineering and pandas.我遇到的问题与数据工程和熊猫有关。 Because I am working only with trading hours and trading days, my dataset has multiple gaps and so the data does not appear continuous.因为我只处理交易时间和交易日,所以我的数据集有多个缺口,因此数据看起来不连续。 The graph below clearly shows that:下图清楚地表明:

在此处输入图片说明

You can see that the large gaps are weekends and the small gaps are post trading hours.您可以看到大的缺口是周末,小缺口是交易后的时间。 The small black squares (if zoomed in) would look something like this:小的黑色方块(如果放大)看起来像这样: 在此处输入图片说明

The dataframe would look like this:数据框看起来像这样:

    arrivalTime           value           date
    0 days 09:30:02.231     1          2021-05-03
    0 days 09:30:02.981     3          2021-05-03
    0 days 09:30:02.999     99         2021-05-03
    0 days 09:30:10.284     11         2021-05-03
    0 days 09:30:10.293     92         2021-05-03
... ... ...
    0 days 15:59:42.654     82         2021-05-28
    0 days 15:59:42.655     19         2021-05-28
    0 days 15:59:42.651     122        2021-05-28
    0 days 15:59:42.941     199        2021-05-28
    0 days 15:59:44.721     19         2021-05-28

The exact thing that I would need is that once a trading day is over, the next day continues exactly after that day has ended.我需要的确切信息是,一旦交易日结束,第二天正好在当天结束后继续。 Let me know if there are any questions如果有任何问题,请告诉我

Thanks!谢谢!

IIUC, you want to shift dates to have continuous dates: IIUC,您想将日期转换为连续日期:

Sample:样本:

>>> df
         date
0  2021-05-05
1  2021-05-05
2  2021-05-05
3  2021-05-06
4  2021-05-06
5  2021-05-06
6  2021-05-07
7  2021-05-07
8  2021-05-07
9  2021-05-10  # <- 2021-05-08
10 2021-05-10  # <- 2021-05-08
11 2021-05-10  # <- 2021-05-08
12 2021-05-11  # <- 2021-05-09
13 2021-05-11  # <- 2021-05-09
14 2021-05-11  # <- 2021-05-09
>>> df['date'].min() + df['date'].diff().ne(pd.Timedelta(0)).cumsum().sub(1) \
                                 .apply(pd.tseries.offsets.Day)
0    2021-05-05
1    2021-05-05
2    2021-05-05
3    2021-05-06
4    2021-05-06
5    2021-05-06
6    2021-05-07
7    2021-05-07
8    2021-05-07
9    2021-05-08
10   2021-05-08
11   2021-05-08
12   2021-05-09
13   2021-05-09
14   2021-05-09
Name: date, dtype: datetime64[ns]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM