[英]Pandas, apply function for each row based on previous rows
我有一个像这样的数据帧:
date compound_score negativity_score positive_score \
0 2017-12-10 0.361400 0.339500 0.311000
1 2017-12-11 0.639950 0.216000 0.476000
2 2017-12-12 0.554286 0.262000 0.464000
3 2017-12-13 0.715275 0.232250 0.423750
4 2017-12-14 0.760940 0.221600 0.476200
5 2017-12-15 0.503886 0.241429 0.391000
6 2017-12-16 0.372300 0.345333 0.356667
7 2017-12-17 0.700900 0.163000 0.458000
8 2017-12-18 0.369733 0.220667 0.364222
9 2017-12-19 0.176000 0.304000 0.362000
10 2017-12-20 0.474322 0.262222 0.426778
11 2017-12-21 0.623620 0.224000 0.435200
12 2017-12-22 0.488125 0.211375 0.438000
13 2017-12-23 0.226900 0.121500 0.341500
14 2017-12-24 0.461800 0.233000 0.545000
15 2017-12-25 0.686040 0.270800 0.458600
16 2017-12-26 0.760525 0.212750 0.527250
17 2017-12-27 0.627575 0.122250 0.463500
18 2017-12-28 0.579173 0.210182 0.381909
19 2017-12-29 0.378815 0.239000 0.339846
20 2017-12-30 0.428200 0.328000 0.349000
21 2017-12-31 -0.116800 0.507000 0.295000
22 2018-01-01 0.515433 0.315000 0.417000
23 2018-01-02 0.380250 0.298250 0.366250
24 2018-01-03 0.609657 0.277000 0.458714
25 2018-01-04 0.751067 0.251667 0.465000
26 2018-01-05 0.207000 0.255750 0.324500
27 2018-01-06 0.853200 0.127000 0.253000
28 2018-01-07 0.506800 0.284500 0.350500
29 2018-01-08 0.499586 0.262571 0.388571
neutral_score compound_diff consecutive_compound
0 0.349500 NaN 0
1 0.308000 0.278550 1
2 0.274143 -0.085664 0
3 0.344000 0.160989 1
4 0.302200 0.045665 1
5 0.367429 -0.257054 0
6 0.298000 -0.131586 0
7 0.379000 0.328600 1
8 0.415111 -0.331167 0
9 0.333800 -0.193733 0
10 0.311000 0.298322 1
11 0.340800 0.149298 1
12 0.350375 -0.135495 0
13 0.537500 -0.261225 0
14 0.222000 0.234900 1
15 0.270800 0.224240 1
16 0.260000 0.074485 1
17 0.414000 -0.132950 0
18 0.407909 -0.048402 0
19 0.420923 -0.200357 0
20 0.323000 0.049385 1
21 0.197000 -0.545000 0
22 0.268000 0.632233 1
23 0.335250 -0.135183 0
24 0.264429 0.229407 1
....
我想对依赖于每行前 14 行的数据框应用计算函数。
我试图从行本身传递一个移位的数据框,但我不太明白如何将函数传递到当前行并在函数中移回 14 天。
我尝试了以下操作,都返回了 Nan 或引发了错误:
def get_up_down_pct_ratio(df):
up_days_pct = df.loc[df[COMPOUND_DIFF] > 0, COMPOUND_DIFF].sum()
fall_days_pct = df.loc[df[COMPOUND_DIFF] < 0, COMPOUND_DIFF].sum()
total = up_days_pct + fall_days_pct
return percent(up_days_pct, total)
d['up_down_ratio'] = d.apply(lambda x: get_up_down_pct_ratio(x.shift(14)),axis=1)
这只是将 Nan 分配给该列
def get_up_down_pct_ratio(row):
up_days_pct = row[row['compound_diff'] > 0, 'compound_diff'].sum()
fall_days_pct = row[row['compound_diff'] > 0, 'compound_diff'].sum()
total = up_days_pct + fall_days_pct
return percent(up_days_pct, total)
a['up_down_pct_ration'] = a.apply(lambda row: get_up_down_pct_ratio(row))
引发的错误:
ValueError: key of type tuple not found and not a MultiIndex
有几点需要注意。
下面是不同的方法。 IE。 创造一流的积累14天循环和处理的所有情况: UP
天与FALL
天。
class accumulate(object):
def __init__(self):
self.accumList = [0 for n in range(14)]
def newDate(self, v, up=True):
self.accumList[0:13] = self.accumList[1:]
v = float(v)
if (v+0.0) != v:
# remove NaN
v = 0.0
elif up and (v < 0) :
# Value > 0
v = 0.0
elif (not up) and (v > 0) :
# track Value < 0
v = 0.0
self.accumList[13] = v
return sum(self.accumList)
a = accumulate()
df['up'] = df.apply(lambda r: a.newDate(r.compound_diff), axis=1)
a = accumulate() # restart rolling amounts
df['fall'] = df.apply(lambda r: a.newDate(r.compound_diff, up=False), axis=1)
df['pct'] = df.up / (df.up + df.fall)
df.head()
@frankr6591 的回答并没有完全让我找到我需要的地方,它确实让我朝着正确的方向前进。
我需要以多种方式在这个数据框上应用这个逻辑,所以我创建了一个更简单、更通用的函数:它确实需要更多的优化,但现在,它可以很好地处理传递给它的不同列
def calculate_two_weeks_data(new_col_name, col_to_run_on):
def calculate_ratio_value(row, df_, col):
index = row['index']
start_idx = index - 14
if start_idx < 0:
return None
else:
prev_rows = df_.iloc[start_idx:index]
col_to_list = prev_rows[col].tolist()
up_values = 0
down_values = 0
for value in col_to_list:
if value > 0:
up_values += value
else:
down_values += value
up_ratio = up_values / (up_values + down_values)
return up_ratio
df.reset_index(inplace=True)
df[new_col_name] = df.apply(calculate_ratio_value, args=[df, col_to_run_on], axis=1)
df.dropna(inplace=True)
return df
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.