简体   繁体   中英

pandas.DataFrame referencing previous row's value using numpy.select() and .shift()

I am trying to move from using Excel to python, and pandas in particular, but am fairly new to it.
So please forgive me if I have made some 'rookie' errors here.
I have a time series DataFrame, df, with columns 'mad' and 'run'. I have added another column 'required result' to show what the value of 'mom_1' should be.
I thought the problem might be using .shift() , but it does appear to work where df['run'] == 1 .

 import datetime
 import numpy as np
 import pandas as pd

index = pd.date_range(datetime.datetime.now().date(), periods=15, freq='D')
df=pd.DataFrame({'mad':[4.267442387,
 4.141153321,
 3.7710860489999996,
 3.242694515,
 2.7432170389999997,
 1.522047198,
 0.21278185100000002,
 1.4125138019999999,
 2.376126224,
 2.759065558,
 3.31686318,
 3.80235022,
 4.486731836000001,
 4.903355638,
 5.0984123619999995
 ],'run':[18,
 19,
 20,
 21,
 22,
 23,
 0,
 1,
 2,
 3,
 4,
 5,
 6,
 7,
 8],'required result':[-0.0013727079999999998,
 -0.006719040999999999,
 -0.018839316000000002,
 -0.026058612000000002,
 -0.023888003999999997,
 -0.05413295,
 0.0,
 1.199731951,
 1.0816721870000001,
 0.48820384,
 0.26150036600000004,
 0.149397481,
 0.13896318300000002,
 0.079369569,
 0.034303287]},index=index)

Column df['mom_1'] should have a value based on 3 conditional arguments.
Basically if the value in my df['run'] column == 0 , then df['mom_1'] should == 0 .
If the df['run'] value == 1 , df['mom_1'] should == df['mad']-df['mad'].shift() values.
The problem arises when I try the 3rd condition;
if the df['run'] > 1 , then df['mom'] should == (df['mad'] - df['mad'].shift() + df['mom_1'].shift())/df['run'] . I can not get the correct value.

I have tried this;

#create a new column 'mom_1'   
df['mom_1']=0 
#use np.select for conditions
df['mom_1']=np.select([df.run==0,df.run==1,df.run>1],[0,df['mad']-df['mad'].shift(),(df['mad']-df['mad'].shift()+df['mom_1'].shift())/df['run']],default=0)

The df['mom_1'] values are correct only when df['run'] == 0 or 1 but I am getting incorrect values whenever df['run'] > 1.
I would be grateful for any guidance.

Using a generator for defining a recursive relation is a standard practice.

Data

Only the first 9 periods ( run=0,1,...,8 ) were kept for logical consistency.

df = pd.DataFrame({
    'mad': [0.21278185100000002, 1.4125138019999999, 2.376126224,
            2.759065558, 3.31686318, 3.80235022,
            4.486731836000001, 4.903355638, 5.0984123619999995],
    'run': range(9),
    'required result': [0.0, 1.199731951, 1.0816721870000001,
                        0.48820384, 0.26150036600000004, 0.149397481,
                        0.13896318300000002, 0.079369569, 0.034303287]
}, index=pd.date_range(datetime.datetime.now().date(), periods=9, freq='D'))

Code

def gen(max_run, mad, a0=0):
    yield a0
    ans_prev = a0
    n = 1
    while n <= max_run:
        # this is the recursive formula
        ans = (mad[n] - mad[n-1] + ans_prev) / n
        yield ans
        # increment for the next round
        ans_prev = ans
        n += 1

df["mom_1"] = list(gen(df["run"].max(), df["mad"].values, a0=0))

Result

print(df)

                 mad  run  required result     mom_1
2020-11-28  0.212782    0         0.000000  0.000000
2020-11-29  1.412514    1         1.199732  1.199732
2020-11-30  2.376126    2         1.081672  1.081672
2020-12-01  2.759066    3         0.488204  0.488204
2020-12-02  3.316863    4         0.261500  0.261500
2020-12-03  3.802350    5         0.149397  0.149397
2020-12-04  4.486732    6         0.138963  0.138963
2020-12-05  4.903356    7         0.079370  0.079370
2020-12-06  5.098412    8         0.034303  0.034303

You can simply break down it into two conditions:

df['mom_1'] = np.where(df['run'] column == 1, df['mad']-df['mad'].shift(), (df['mad'] - df['mad'].shift() + df['mom_1'].shift())/df['run'])

df['mom_1'] = np.where(df['run'] column == 0, 0, df['mom_1'])
df = pd.DataFrame({
    'mad': [0.21278185100000002, 1.4125138019999999, 2.376126224,
            2.759065558, 3.31686318, 3.80235022,
            4.486731836000001, 4.903355638, 5.0984123619999995],
    'run': range(9),
    'required result': [0.0, 1.199731951, 1.0816721870000001,
                        0.48820384, 0.26150036600000004, 0.149397481,
                        0.13896318300000002, 0.079369569, 0.034303287]
}, index=pd.date_range(datetime.datetime.now().date(), periods=9, freq='D'))`enter code here`

I added a couple of columns to get a cumulative count of rows.

df['count']=1
df['cumcount']=df['count'].cumsum()

and then used this;

mom = []
mad = list(df['mad'])
run = list(df['run'])
for index, row in df.iterrows():
    if row['run'] == 0:
        mom.append(0)
    elif row['run'] > 0:
        row_num = int(row['cumcount']-1)

        mom.append((mad[row_num]-mad[row_num-1]+mom[row_num-1])/int(run[row_num]))
    else:
        raise ValueError("Index contains negative values")
df['mom']=mom

which seems to have worked.

print(df)

                 mad  run  required result  count  cumcount       mom
2020-12-02  0.212782    0         0.000000      1         1  0.000000
2020-12-03  1.412514    1         1.199732      1         2  1.199732
2020-12-04  2.376126    2         1.081672      1         3  1.081672
2020-12-05  2.759066    3         0.488204      1         4  0.488204
2020-12-06  3.316863    4         0.261500      1         5  0.261500
2020-12-07  3.802350    5         0.149397      1         6  0.149397
2020-12-08  4.486732    6         0.138963      1         7  0.138963
2020-12-09  4.903356    7         0.079370      1         8  0.079370
2020-12-10  5.098412    8         0.034303      1         9  0.034303

I have read a lot of opinions that seem to discourage this type of iteration with pandas DataFrames so I would be interested if anyone has an alternative.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM