简体   繁体   中英

pandas - Access value in a previous row in a Dataframe

I am trying to add a new column to my dataframe that depends on values that may or may not exist in previous rows. My dataframe looks like this:

index  id  timestamp  sequence_index value  prev_seq_index
0      10  1          0              5      0
1      10  1          1              1      2
2      10  1          2              2      0
3      10  2          0              9      0
4      10  2          1              10     1
5      10  2          2              3      1
6      11  2          0              42     1
7      11  2          1              13     0

Note : there is no relation between index and sequence_index , index is just a counter.

What I want to do is add a column prev_value , that finds the value of the most recent row with the same id and sequence_index == prev_seq_index , if no such previous row exist, use default value, for the purpose of this question I will use default value of -1

index  id  timestamp  sequence_index value  prev_seq_index  prev_value
0      10  1          0              5      0               -1
1      10  1          1              1      2               -1
2      10  1          2              2      0               -1
3      10  2          0              9      0               5  # value from df[index == 0]
4      10  2          1              10     1               1  # value from df[index == 1]
5      10  2          2              3      1               1  # value from df[index == 1]
6      11  2          0              42     1               -1
7      11  2          1              13     0               -1

My current solution is a brute force which is very slow, and I was wondering if there was a faster way:

prev_values = np.zeros(len(df))
i = 0
for index, row in df.iterrows():
    # filter for previous rows with the same id and desired sequence index
    tmp_df = df[(df.id == row.id) & (df.timestamp < row.timestamp) \
                 & (df.sequence_index == row.prev_seq_index)]
    if (len(tmp_df) > 0):
        # get value from the most recent row
        prev_value = tmp_df[tmp_df.index == tmp_df.timestamp.idxmax()].value
    else:
        prev_value = -1
    prev_values[i] = prev_value
    i += 1

df['prev_value'] = prev_values

i would suggest tackling this via a left join. However first you'll need to make sure that your data doesn't have duplicates. You'll need to create a dataframe of most recent timestamps and grab the values.

agg=pd.groupby(['sequence_index']).agg({'timestamp':'max'})

agg=pd.merge(agg,df['timestamp','sequence_index','value'], how='inner', on = ['timestamp','sequence_index'])

agg.rename(columns={'value': 'prev_value'}, inplace=True)

now you can join the data back on itself

df=pd.merge(df,agg,how='left',left_on='prev_seq_index',right_on='sequence_index')

now you can deal with the NaN values

df.prev_value=df.prev_value.fillna(-1)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM