I am trying to move from using Excel to python, and pandas in particular, but am fairly new to it.
So please forgive me if I have made some 'rookie' errors here.
I have a time series DataFrame, df, with columns 'mad' and 'run'. I have added another column 'required result' to show what the value of 'mom_1' should be.
I thought the problem might be using .shift()
, but it does appear to work where df['run'] == 1
.
import datetime
import numpy as np
import pandas as pd
index = pd.date_range(datetime.datetime.now().date(), periods=15, freq='D')
df=pd.DataFrame({'mad':[4.267442387,
4.141153321,
3.7710860489999996,
3.242694515,
2.7432170389999997,
1.522047198,
0.21278185100000002,
1.4125138019999999,
2.376126224,
2.759065558,
3.31686318,
3.80235022,
4.486731836000001,
4.903355638,
5.0984123619999995
],'run':[18,
19,
20,
21,
22,
23,
0,
1,
2,
3,
4,
5,
6,
7,
8],'required result':[-0.0013727079999999998,
-0.006719040999999999,
-0.018839316000000002,
-0.026058612000000002,
-0.023888003999999997,
-0.05413295,
0.0,
1.199731951,
1.0816721870000001,
0.48820384,
0.26150036600000004,
0.149397481,
0.13896318300000002,
0.079369569,
0.034303287]},index=index)
Column df['mom_1']
should have a value based on 3 conditional arguments.
Basically if the value in my df['run'] column == 0
, then df['mom_1'] should == 0
.
If the df['run'] value == 1
, df['mom_1'] should == df['mad']-df['mad'].shift()
values.
The problem arises when I try the 3rd condition;
if the df['run'] > 1
, then df['mom'] should == (df['mad'] - df['mad'].shift() + df['mom_1'].shift())/df['run']
. I can not get the correct value.
I have tried this;
#create a new column 'mom_1'
df['mom_1']=0
#use np.select for conditions
df['mom_1']=np.select([df.run==0,df.run==1,df.run>1],[0,df['mad']-df['mad'].shift(),(df['mad']-df['mad'].shift()+df['mom_1'].shift())/df['run']],default=0)
The df['mom_1'] values are correct only when df['run'] == 0 or 1 but I am getting incorrect values whenever df['run'] > 1.
I would be grateful for any guidance.
Using a generator
for defining a recursive relation is a standard practice.
Only the first 9
periods ( run=0,1,...,8
) were kept for logical consistency.
df = pd.DataFrame({
'mad': [0.21278185100000002, 1.4125138019999999, 2.376126224,
2.759065558, 3.31686318, 3.80235022,
4.486731836000001, 4.903355638, 5.0984123619999995],
'run': range(9),
'required result': [0.0, 1.199731951, 1.0816721870000001,
0.48820384, 0.26150036600000004, 0.149397481,
0.13896318300000002, 0.079369569, 0.034303287]
}, index=pd.date_range(datetime.datetime.now().date(), periods=9, freq='D'))
def gen(max_run, mad, a0=0):
yield a0
ans_prev = a0
n = 1
while n <= max_run:
# this is the recursive formula
ans = (mad[n] - mad[n-1] + ans_prev) / n
yield ans
# increment for the next round
ans_prev = ans
n += 1
df["mom_1"] = list(gen(df["run"].max(), df["mad"].values, a0=0))
print(df)
mad run required result mom_1
2020-11-28 0.212782 0 0.000000 0.000000
2020-11-29 1.412514 1 1.199732 1.199732
2020-11-30 2.376126 2 1.081672 1.081672
2020-12-01 2.759066 3 0.488204 0.488204
2020-12-02 3.316863 4 0.261500 0.261500
2020-12-03 3.802350 5 0.149397 0.149397
2020-12-04 4.486732 6 0.138963 0.138963
2020-12-05 4.903356 7 0.079370 0.079370
2020-12-06 5.098412 8 0.034303 0.034303
You can simply break down it into two conditions:
df['mom_1'] = np.where(df['run'] column == 1, df['mad']-df['mad'].shift(), (df['mad'] - df['mad'].shift() + df['mom_1'].shift())/df['run'])
df['mom_1'] = np.where(df['run'] column == 0, 0, df['mom_1'])
df = pd.DataFrame({
'mad': [0.21278185100000002, 1.4125138019999999, 2.376126224,
2.759065558, 3.31686318, 3.80235022,
4.486731836000001, 4.903355638, 5.0984123619999995],
'run': range(9),
'required result': [0.0, 1.199731951, 1.0816721870000001,
0.48820384, 0.26150036600000004, 0.149397481,
0.13896318300000002, 0.079369569, 0.034303287]
}, index=pd.date_range(datetime.datetime.now().date(), periods=9, freq='D'))`enter code here`
I added a couple of columns to get a cumulative count of rows.
df['count']=1
df['cumcount']=df['count'].cumsum()
and then used this;
mom = []
mad = list(df['mad'])
run = list(df['run'])
for index, row in df.iterrows():
if row['run'] == 0:
mom.append(0)
elif row['run'] > 0:
row_num = int(row['cumcount']-1)
mom.append((mad[row_num]-mad[row_num-1]+mom[row_num-1])/int(run[row_num]))
else:
raise ValueError("Index contains negative values")
df['mom']=mom
which seems to have worked.
print(df)
mad run required result count cumcount mom
2020-12-02 0.212782 0 0.000000 1 1 0.000000
2020-12-03 1.412514 1 1.199732 1 2 1.199732
2020-12-04 2.376126 2 1.081672 1 3 1.081672
2020-12-05 2.759066 3 0.488204 1 4 0.488204
2020-12-06 3.316863 4 0.261500 1 5 0.261500
2020-12-07 3.802350 5 0.149397 1 6 0.149397
2020-12-08 4.486732 6 0.138963 1 7 0.138963
2020-12-09 4.903356 7 0.079370 1 8 0.079370
2020-12-10 5.098412 8 0.034303 1 9 0.034303
I have read a lot of opinions that seem to discourage this type of iteration with pandas DataFrames so I would be interested if anyone has an alternative.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.