简体   繁体   中英

Pandas.DataFrame: efficient way to add a column "seconds since last event"

I have a Pandas.DataFrame with a standard index representing seconds, and I want to add a column "seconds elapsed since last event" where the events are given in a list. Specifically, say

event = [2, 5]

and

df = pd.DataFrame(np.zeros((7, 1)))
|    |   0 |
|---:|----:|
|  0 |   0 |
|  1 |   0 |
|  2 |   0 |
|  3 |   0 |
|  4 |   0 |
|  5 |   0 |
|  6 |   0 |

Then I want to obtain

|    |   0 |    x |
|---:|----:|-----:|
|  0 |   0 | <NA> |
|  1 |   0 | <NA> |
|  2 |   0 |    0 |
|  3 |   0 |    1 |
|  4 |   0 |    2 |
|  5 |   0 |    0 |
|  6 |   0 |    1 |

I tried

df["x"] = pd.Series(range(5)).shift(2)

|    |   0 |   x |
|---:|----:|----:|
|  0 |   0 | nan |
|  1 |   0 | nan |
|  2 |   0 |   0 |
|  3 |   0 |   1 |
|  4 |   0 |   2 |
|  5 |   0 | nan |
|  6 |   0 | nan |

so apparently to make it work I need to write df["x"] = pd.Series(range(5+2)).shift(2) .

More importantly, when I then do df["x"] = pd.Series(range(2+5)).shift(5) I obtain

|    |   0 |   x |
|---:|----:|----:|
|  0 |   0 | nan |
|  1 |   0 | nan |
|  2 |   0 | nan |
|  3 |   0 | nan |
|  4 |   0 | nan |
|  5 |   0 |   0 |
|  6 |   0 |   1 |

That is: the previous has been overwritten. Is there a way to assign new values without overwriting existing values by nan ? Then, I can do something like

for i in event:
    df["x"] = pd.Series(range(len(df))).shift(i)

Or is there a more efficient way ?

For the record, here is my naive code. It works, but looks inefficient and of poor design:

c = 1000000
df["x"] = c
if event:
    idx = 0
    for i in df.itertuples():
        print(i)
        if idx < len(event) and i.Index == event[idx]:
            c = 0
            idx += 1
        df.loc[i.Index, "x"] = c
        c += 1
return df

Let's try this:

df = pd.DataFrame(np.zeros((7, 1)))
event = [2, 5]

df.loc[event, 0] = 1
df = df.replace(0, np.nan)

grp=df[0].cumsum().ffill()
df['x'] = df.groupby(grp).cumcount().mask(grp.isna())
df

Output:

|    |   0 |   x |
|---:|----:|----:|
|  0 | nan | nan |
|  1 | nan | nan |
|  2 |   1 |   0 |
|  3 | nan |   1 |
|  4 | nan |   2 |
|  5 |   1 |   0 |
|  6 | nan |   1 |

IIUC, you can do double groupby:

s = df.index.isin(event).cumsum()
# or equivalently
# s = df.loc[event, 0].reindex(df.index).isna().cumsum()

df['x'] = np.where(s>0,df.groupby(s).cumcount(), np.nan)

Output:

     0    x
0  0.0  NaN
1  0.0  NaN
2  0.0  0.0
3  0.0  1.0
4  0.0  2.0
5  0.0  0.0
6  0.0  1.0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM