简体   繁体   中英

Secondary row value of highest rolling sums pandas

I am trying to get the max value of one row, according to the cumulative sum of a different row. My dataframe looks like this:

df = pd.DataFrame({'constant': ['a', 'b', 'b', 'c', 'c', 'd', 'a'], 'value': [1, 3, 1, 5, 1, 9, 2]})

indx  constant  value
0        a        1
1        b        3
2        b        1
3        c        5
4        c        1
5        d        9
6        a        2

I am trying to add a new field, with the constant that has the highest cumulative sum of value up to that point in the dataframe. the final dataframe would look like this:

indx constant   value   new_field
0      a          1         NaN
1      b          3          a
2      b          1          b
3      c          5          b
4      c          1          c
5      d          9          c
6      a          2          d

As you can see, at index 1, a has the highest cumulative sum of value for all prior rows. At index 2, b has the highest cumulative sum of value for all prior rows, and so on.

Anyone have a solution?

As presented, you just need a shift. However try the following for other scenarios.

Steps Find the cummulative maximum

Where the cummulative max is equal to df['value'], copy the 'constant', otherwise make it a NaN

The NaNs should leave chance to broadcast the constant corresponding to the max value

Outcome

df=df.assign(new_field=(np.where(df['value']==df['value'].cummax(), df['constant'], np.nan))).ffill()
df=df.assign(new_field=df['new_field'].shift())



   constant  value new_field
0        a      1       NaN
1        b      3         a
2        b      1         b
3        c      5         b
4        c      1         c
5        d      9         c
6        a      2         d

You should be a little more careful (since values can be negative value which decrease cumsum), here is what you probably need to do,

df["cumsum"] = df["value"].cumsum()
df["cummax"] = df["cumsum"].cummax()
df["new"] = np.where(df["cumsum"] == df["cummax"], df['constant'], np.nan)
df["new"] = df.ffill()["new"].shift()
df

I think you should try and approach this as a pivot table, which would allow you to use np.argmax over the column axis.

# this will count cummulative occurences over the ix for each value of `constant`
X = df.pivot_table(
    index=df.index,
    columns=['constant'],
    values='value'
).fillna(0.0).cumsum(axis=0)

# now you get a list of ixs that max the cummulative value over the column axis - i.e., the "winner"
colix = np.argmax(X.values, axis=1)

# you can fetch corresponding column names using this argmax index
df['winner'] = np.r_[[np.nan], X.columns[colix].values[:-1]]

# and there you go
df

constant    value   winner
0   a   1   NaN
1   b   3   a
2   b   1   b
3   c   5   b
4   c   1   c
5   d   9   c
6   a   2   d

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM