简体   繁体   中英

Select CONSECUTIVE rows from a DataFrame based on values in a column in Pandas with Groupby

I have 5 years of S & P 500 data that I'm trying to group into specific time chunks to run some analysis on. My data is in 5 minute increments. After reading it into a DataFrame called dated, I thought I could groupby it into chunks, encompassing consecutive rows, by the time increments between the Globex open and close for each trading day. The Globex open is at 3:00PM (15:00) PST and the close is 1:15PM PST (13:15), 22.25 hours later. So I would like to chunk the data from the 3:00PM open THROUGH to the 1:15PM close--that's approximately 275 rows per chunk.

The difficulty is that a 'trading day' spans 2 different dates, for eaxmple, a session opens on Sunday 3-6-2016 at 15:00 and closes on Monday, 3-7-2016 at 13:15. I'm not able to groupby the 'Date' column because of course all of 3-6 would be in one chunk and 3-7 in another, when I need the data chunk to span both dates, in order to get the whole Globex day in one chunk.

Being relatively new to both Pandas and Python I don't know what method to use to group the data into the chunks that I want. Once the data is segregated, I'd like to extract the high and low of each session/chunk into separate columns and also a column for just the 15:05 open price.

Here is a sample of what my data looks like:

    Date    Time    Open    High    Low Close   Up  Down
0   2015-08-03  15:05   2073.50 2074.00 2073.25 2073.25 210 563
1   2015-08-03  15:10   2073.25 2073.25 2072.25 2072.75 118 632
2   2015-08-03  15:15   2072.75 2072.75 2072.25 2072.50 132 85
3   2015-08-03  15:20   2072.50 2072.75 2072.25 2072.50 95  312
4   2015-08-03  15:25   2072.50 2074.00 2072.50 2073.50 372 264

Originally, the 'Date' column values looked like this: 8/3/2015. I thought that it may not be read as an actual date object, so I changed the values using to_datetime() to get the 'Date' column values to read like an actual date object as seen in my sample DataFrame above.

dated['Date'] =pd.to_datetime(dated['Date'])

When I tried to change the values in the 'Time' column using to_datetime(), it successfully changes the time from 15:05 to 15:05:00, but it also added the day, so it looked like this: '2016-03-05 15:05:00', the problem being that it used today's date for the date. This obviously won't work for me as my data is historical and the dates and times are references to historical prices.
The reason I was trying to change the 'Time' column to a datetime object is I thought I would be able to slice it into the chunks that I need during a groupby operation:

dated = dated['Date'].groupby(dated['15:05' : '13:20'])  

This produced and error:

IndexError: invalid slice

So I'd appreciate any help solving this problem--pointing me in the correct areas of research. I've been essentially reading the pandas documentation piece by piece, trying out different methods, but as I'm not sure what steps to start with, I've been picking topics at random to read and not finding the answer.

Thanks, Ana

This is actually very complicated.

First, you can convert your time as follows:

df['Datetime'] = pd.to_datetime(df.Date + ' ' + df.Time)

Here, I will create a larger sample datframe:

np.random.seed(0)
idx = pd.date_range('2015-1-1', '2016-1-1', freq='5min')
df = pd.DataFrame(np.random.randn(len(idx), 6),    
                  columns=['Open', 'High', 'Low', 'Close', 'Up', 'Down'])
df['Datetime'] = idx

Let's add a boolean flag to indicate when the market is open.

# Create a market open flag.
df['market_open'] = False
mask = (df.Datetime.dt.time > dt.time(15)) | (df.Datetime.dt.time < dt.time(13, 15))
df.loc[mask, 'market_open'] = True

Here we create a function than returns the open, high, low, close, etc. on the grouped bars:

def ohlc(df):
    return (
        df.Datetime.iat[-1], # last timestamp in group.
        df.Open.iat[0], # First Open.
        df.High.max(), 
        df.Low.min(), 
        df.Close.iat[-1], # Last Close.
        df.Up.sum(), 
        df.Down.sum(),
        df.Close.count(), # Count number of closing bars.
        df.market_open.iat[0])   # Take first True/False indicator.

Now we do our groupby based on the change in market_open (ie on changes in the True/False flag) and then we apply our function to these grouped results.

bars = pd.DataFrame(
    zip(*df.groupby(
            (df.market_open != df.market_open.shift())
            .cumsum()
             ).apply(ohlc))).T

bars.columns = ['bar_close_time', 'Open', 'High', 'Low', 'Close', 'Up', 'Down', 'bar_count', 'market_open']

We have bars for both open and closed sessions. We can remove the ones for when the market is closed.

# Remove bars when market is closed
bars = bars[bars.market_open].iloc[:, :-1]

>>> bars.tail()
          bar_close_time      Open     High      Low      Close        Up     Down bar_count
722  2015-12-28 13:10:00   1.23175  2.88569  -2.7143  -0.785648  -13.3166  14.6094       266
724  2015-12-29 13:10:00 -0.900675   2.6483 -2.61698    -0.8265  0.825872  4.98565       266
726  2015-12-30 13:10:00   1.65299  2.57881 -2.85199  -0.376141  -4.32867  3.62123       266
728  2015-12-31 13:10:00  0.435619  2.93638 -2.74758  -0.461525  -20.0928 -15.8205       266
730  2016-01-01 00:00:00  0.293165  2.39097  -2.1234  0.0684124  -7.83721  1.69182       108

You could consider an auxiliary datetime column that equals the original one minus 14 hours. Then each trading session would be in just one day.

This is the same as considering another time zone for datetime, in which the trading session does not span two days.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM