简体   繁体   中英

How to resample a Pandas DataFrame at a lower frequency and stop it creating NaN's?

I have a Pandas Dataframe with a DateTime index. It has closing prices of some stocks sampled at the 1-minute interval. I want to resample this dataframe and get it at the 5-minute interval, as if it had been collected in that way. For example:

                         SPY     AAPL
DateTime        
2014-01-02 09:30:00     183.91  555.890
2014-01-02 09:31:00     183.89  556.060
2014-01-02 09:32:00     183.90  556.180
2014-01-02 09:33:00     184.00  556.550
2014-01-02 09:34:00     183.98  556.325
2014-01-02 09:35:00     183.89  554.620
2014-01-02 09:36:00     183.83  554.210

I need to get something like

                         SPY     AAPL
DateTime        
2014-01-02 09:30:00     183.91  555.890
2014-01-02 09:35:00     183.89  554.620

The natural method would be resample() or asfreq() with Pandas. They indeed produce what I need, however with some undesired output as well. My sample has no observations from 4pm of a given weekday until 9:30am of the following day because trading halts during these hours. These mentioned methods end up completing the dataframe with NaN during these periods when there is actually no data to resample from. Is there any option I can use to avoid this behavior? From 4:05pm until 9:25am of the following day I get lots of NaN and just that!

My quick and dirty solution was the following:

Prices_5min = Prices[np.remainder(Prices.index.minute, 5) == 0]

Although I believe that this a quick and elegant solution, I'd would assume that resample() has some option to perform this task. Any ideas? Thanks a lot!


EDIT: Following the comment regarding the undesired output, I add the following code to showcase the problem:

New_Prices = Prices.asfreq('5min')
New_Prices.loc['2014-01-02 15:50:00':'2014-01-03 9:05:00']
Out:
                         SPY    AAPL
DateTime        
2014-01-02 15:50:00     183.12  552.83
2014-01-02 15:55:00     183.08  552.89
2014-01-02 16:00:00     182.92  553.18
2014-01-02 16:05:00     NaN     NaN
2014-01-02 16:10:00     NaN     NaN
...     ...     ...
2014-01-03 08:45:00     NaN     NaN
2014-01-03 08:50:00     NaN     NaN
2014-01-03 08:55:00     NaN     NaN
2014-01-03 09:00:00     NaN     NaN
2014-01-03 09:05:00     NaN     NaN

All these NaN should be part of the final result. They are only there because there were no trading hours. I want to avoid that.

You could simply discard the rows containing NaN values with dropna() .

Demo with a slightly modified version of your input data:

                        SPY     AAPL
DateTime                            
2014-01-02 09:30:00  183.91  555.890
2014-01-02 09:31:00  183.89  556.060
2014-01-02 09:32:00  183.90  556.180
2014-01-02 09:33:00  184.00  556.550
2014-01-02 09:34:00  183.98  556.325
2014-01-02 09:45:00  183.89  554.620
2014-01-02 09:46:00  183.83  554.210

Straight resampling gives rows with NaN values:

df.asfreq('5min')

                        SPY    AAPL
DateTime                           
2014-01-02 09:30:00  183.91  555.89
2014-01-02 09:35:00     NaN     NaN
2014-01-02 09:40:00     NaN     NaN
2014-01-02 09:45:00  183.89  554.62

which go avay with dropna() :

df.asfreq('5min').dropna()

                        SPY    AAPL
DateTime                           
2014-01-02 09:30:00  183.91  555.89
2014-01-02 09:45:00  183.89  554.62

Overview: Create an interval index to describe trading times (0930 to 1400 on business days). Then find the time stamps (from resample) that are in the trading window.

import pandas as pd

bdate_range = pd.bdate_range(start='2014-01-02', periods=5)
bdate_range

trading_windows = [
    (d + pd.Timedelta('9.5h'), d + pd.Timedelta('16h'))
    for d in bdate_range
]
trading_windows

trading_windows = pd.IntervalIndex.from_tuples(trading_windows)

for t in trading_windows: print(t)

(2014-01-02 09:30:00, 2014-01-02 16:00:00]
(2014-01-03 09:30:00, 2014-01-03 16:00:00]
(2014-01-06 09:30:00, 2014-01-06 16:00:00]
(2014-01-07 09:30:00, 2014-01-07 16:00:00]
(2014-01-08 09:30:00, 2014-01-08 16:00:00]

...and created a list of the 5-minute intervals from your example (some during trading hours, other time stamps when trading is halted)

stamps = [
    '2014-01-02 15:50:00',
    '2014-01-02 15:55:00',
    '2014-01-02 16:00:00',
    '2014-01-02 16:05:00',
    '2014-01-02 16:10:00',
]
stamps = pd.to_datetime(stamps)

Then, I used the .contains() method of Interval Index to determine whether a timestamp (from resample) is during the trading window:

mask = [trading_windows.contains(stamp).any() for stamp in stamps]
stamps[mask]


[3]:
DatetimeIndex(['2014-01-02 15:50:00', '2014-01-02 15:55:00',
               '2014-01-02 16:00:00'],
              dtype='datetime64[ns]', freq=None)

This keeps all time stamps during the trading window (whether there are actual trades or not). And you can include holidays in the creation of 'trading_windows'.

Possibly resampling at 5 min freq along with 'last' statistic must work in ur case U can specify the labels as the right and include the right end in the resampling

Finally, u can apply ffill in the to avoid time leakage

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM