简体   繁体   中英

DataFrame group and shift to get previous column values

I have a DF like so:


    asset_id    source_id   open_px close_px    start_bin           end_bin
0   1           a           None    10          2022-01-01 09:30:00 2022-01-01 10:00:00
1   1           a           None    10          2022-01-01 10:00:00 2022-01-01 10:30:00
2   2           a           None    101         2022-01-01 09:30:00 2022-01-01 10:00:00
3   2           a           None    500         2022-01-01 10:00:00 2022-01-01 10:30:00
4   2           a           None    600         2022-01-01 10:30:00 2022-01-01 11:00:00

code to generate:

rows=[
    [1, 'a', None, 10, datetime.datetime(2022, 1, 1, 9, 30), datetime.datetime(2022, 1, 1, 10, 0)],
    [1, 'a', None, 10, datetime.datetime(2022, 1, 1, 10, 0), datetime.datetime(2022, 1, 1, 10, 30)],
    [2, 'a', None, 101, datetime.datetime(2022, 1, 1, 9, 30), datetime.datetime(2022, 1, 1, 10, 0)],
    [2, 'a', None, 500, datetime.datetime(2022, 1, 1, 10, 0), datetime.datetime(2022, 1, 1, 10, 30)],
    [2, 'a', None, 600, datetime.datetime(2022, 1, 1, 10, 30), datetime.datetime(2022, 1, 1, 11, 0)]
]

cols = ['asset_id', 'source_id', 'open_px', 'close_px', 'start_bin', 'end_bin']

df = pd.DataFrame(data=rows, columns=cols)

I want to get the open_px by getting the last close from the bin corresponding to this rows start_bin, but also grouped by asset_id, in the most pandas friendly way. (happy for the first entry in each to remain None ). I do not want to brute force with a loop as the dataset is quite large.

Expected Output:


    asset_id    source_id   open_px close_px    start_bin           end_bin
0   1           a           None    10          2022-01-01 09:30:00 2022-01-01 10:00:00
1   1           a           10      10          2022-01-01 10:00:00 2022-01-01 10:30:00
2   2           a           None    101         2022-01-01 09:30:00 2022-01-01 10:00:00
3   2           a           101     500         2022-01-01 10:00:00 2022-01-01 10:30:00
4   2           a           500     600         2022-01-01 10:30:00 2022-01-01 11:00:00
df.sort_values(['asset_id','start_bin'], inplace=True)
df['open_px'] = df['close_px'].shift()
df.loc[~df['asset_id'].duplicated(),'open_px'] = None
print(df)

   asset_id source_id  open_px  close_px           start_bin             end_bin
0         1         a      NaN        10 2022-01-01 09:30:00 2022-01-01 10:00:00  
1         1         a     10.0        10 2022-01-01 10:00:00 2022-01-01 10:30:00  
2         2         a      NaN       101 2022-01-01 09:30:00 2022-01-01 10:00:00  
3         2         a    101.0       500 2022-01-01 10:00:00 2022-01-01 10:30:00  
4         2         a    500.0       600 2022-01-01 10:30:00 2022-01-01 11:00:00  

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM