简体   繁体   中英

Pandas column that depends on its previous value (row)?

I would like to create a 3rd column in my dataframe, which depends on both the new and existing columns in the previous row.

This new column should start at 0.

I would like my 3rd column to start at 0.

Its next value is its previous value plus df.below_lo[i] (if the previous value was 0).

If its previous value was 1, its next value is its previous value plus df.above_hi[i] .

I think I have two issues: how to initiate this 3rd column and how to make it dependent on itself.

import pandas as pd
import math

data = {'below_lo': [0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
        'above_hi': [0, 0, -1, 0, -1, 0, -1, 0, 0, 0, 0, 0, 0]}

df = pd.DataFrame(data)

df['pos'] = math.nan
df['pos'][0] = 0

for i in range(len(df.below_lo)):
    if df.pos[i] == 0:
        df.pos[i+1] = df.pos[i] + df.below_lo[i]
    if df.pos[i] == 1:
        df.pos[i+1] = df.pos[i] + df.above_hi[i]

print(df)

The desired output is:

    below_lo  above_hi  pos
0        0.0       0.0  0.0
1        1.0       0.0  0.0
2        0.0      -1.0  1.0
3        0.0       0.0  0.0
4        0.0      -1.0  0.0
5        0.0       0.0  0.0
6        0.0      -1.0  0.0
7        0.0       0.0  0.0
8        0.0       0.0  0.0
9        1.0       0.0  0.0
10       0.0       0.0  1.0
11       0.0       0.0  1.0
12       0.0       0.0  1.0
13       NaN       NaN  1.0

The above code produces the correct output, except I am also getting a few of these error messages:

A value is trying to be set on a copy of a slice from a DataFrame

How do I clean this code up so that it runs without throwing this warning? ?

Use .loc :

df.loc[0, 'pos'] = 0

for i in range(len(df.below_lo)):
    if df.loc[i, 'pos'] == 0:
        df.loc[i+1, 'pos'] = df.loc[i, 'pos'] + df.loc[i, 'below_lo']
    if df.loc[i, 'pos'] == 1:
        df.loc[i+1, 'pos'] = df.loc[i, 'pos'] + df.loc[i, 'above_hi']

Appreciate there is an accepted, and perfectly good, answer by @Michael O. already, but if you dislike iterating over rows as not-quite Pandas-esque, here is a solution without explicit looping over rows:

from functools import reduce
res = reduce(lambda d, _ : 
    d.fillna({'pos':d['pos'].shift(1) 
            + (d['pos'].shift(1)  == 0) * d['below_lo'] 
            + (d['pos'].shift(1)  == 1) * d['above_hi']}), 
        range(len(df)), df)
res

produces

      below_lo    above_hi    pos
--  ----------  ----------  -----
 0           0           0      0
 1           1           0      1
 2           0          -1      0
 3           0           0      0
 4           0          -1      0
 5           0           0      0
 6           0          -1      0
 7           0           0      0
 8           0           0      0
 9           1           0      1
10           0           0      1
11           0           0      1
12           0           0      1

It is, admittedly, somewhat less efficient and has a bit more obscure syntax. But it could be written on a single line (even if I split it over a few for clarity)!

The idea is that we can use fillna(..) function by passing the value, calculated from the previous value of 'pos' (hence shift(1) ) and current values of 'below_lo' and 'above_hi'. The extra complication here is that this operation will only fill NaN with a non-NaN for the row just below the one with non-NaN value. Hence we need to apply this function repeatedly until all NaNs are filled, and this is where reduce comes into play

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM