![](/img/trans.png)
[英]is there a way in Pandas to use previous row value to compute new values for a row
[英]python - use previous row's value to update the new rows values
這是當前的數據幀:
> ID Date current
> 2001980 10/30/2017 1
> 2001980 10/29/2017 0
> 2001980 10/28/2017 0
> 2001980 10/27/2017 40
> 2001980 10/26/2017 39
> 2001980 10/25/2017 0
> 2001980 10/24/2017 0
> 2001980 10/23/2017 60
> 2001980 10/22/2017 0
> 2001980 10/21/2017 0
> 2002222 10/21/2017 0
> 2002222 10/20/2017 0
> 2002222 10/19/2017 16
> 2002222 10/18/2017 0
> 2002222 10/17/2017 0
> 2002222 10/16/2017 20
> 2002222 10/15/2017 19
> 2002222 10/14/2017 18
以下是最終的數據框架。 expected
列是我想要的。
非常感謝。
> ID Date current expected
> 2001980 10/30/2017 1 1
> 2001980 10/29/2017 0 0
> 2001980 10/28/2017 0 0
> 2001980 10/27/2017 40 40
> 2001980 10/26/2017 39 39
> 2001980 10/25/2017 0 38
> 2001980 10/24/2017 0 37
> 2001980 10/23/2017 60 60
> 2001980 10/22/2017 0 59
> 2001980 10/21/2017 0 58
> 2002222 10/21/2017 0 0
> 2002222 10/20/2017 0 0
> 2002222 10/19/2017 16 16
> 2002222 10/18/2017 0 15
> 2002222 10/17/2017 0 14
> 2002222 10/16/2017 20 20
> 2002222 10/15/2017 19 19
> 2002222 10/14/2017 18 18
我正在使用以下公式的Excel:
= if(此行的ID =最后一行的ID,最大值(最后一行的預期值-1,此行的當前值),此行的當前值)
修改更簡單:
df['expected'] = df.groupby(['ID',df.current.ne(0).cumsum()])['current']\
.transform(lambda x: x.eq(0).cumsum().mul(-1).add(x.iloc[0])).clip(0,np.inf)
讓我們有一點樂趣:
df['expected'] = (df.groupby('ID')['current'].transform(lambda x: x.where(x.ne(0)).ffill()) +
df.groupby(['ID',df.current.ne(0).cumsum()])['current'].transform(lambda x: x.eq(0).cumsum()).mul(-1))\
.clip(0,np.inf).fillna(0).astype(int)
print(df)
輸出:
ID Date current expected
0 2001980 10/30/2017 1 1
1 2001980 10/29/2017 0 0
2 2001980 10/28/2017 0 0
3 2001980 10/27/2017 40 40
4 2001980 10/26/2017 39 39
5 2001980 10/25/2017 0 38
6 2001980 10/24/2017 0 37
7 2001980 10/23/2017 60 60
8 2001980 10/22/2017 0 59
9 2001980 10/21/2017 0 58
10 2002222 10/21/2017 0 0
11 2002222 10/20/2017 0 0
12 2002222 10/19/2017 16 16
13 2002222 10/18/2017 0 15
14 2002222 10/17/2017 0 14
15 2002222 10/16/2017 20 20
16 2002222 10/15/2017 19 19
17 2002222 10/14/2017 18 18
#Let's calculate two series first a series to fill the zeros in an 'ID' with the previous non-zero value
s1 = df.groupby('ID')['current'].transform(lambda x: x.where(x.ne(0)).ffill())
s1
輸出:
0 1.0
1 1.0
2 1.0
3 40.0
4 39.0
5 39.0
6 39.0
7 60.0
8 60.0
9 60.0
10 NaN
11 NaN
12 16.0
13 16.0
14 16.0
15 20.0
16 19.0
17 18.0
Name: current, dtype: float64
#Second series is a cumulative count of zeroes in a group by 'ID'
s2 = df.groupby(['ID',df.current.ne(0).cumsum()])['current'].transform(lambda x: x.eq(0).cumsum()).mul(-1)
s2
輸出:
0 0
1 -1
2 -2
3 0
4 0
5 -1
6 -2
7 0
8 -1
9 -2
10 -1
11 -2
12 0
13 -1
14 -2
15 0
16 0
17 0
Name: current, dtype: int32
(s1 + s2).clip(0, np.inf).fillna(0)
輸出:
0 1.0
1 0.0
2 0.0
3 40.0
4 39.0
5 38.0
6 37.0
7 60.0
8 59.0
9 58.0
10 0.0
11 0.0
12 16.0
13 15.0
14 14.0
15 20.0
16 19.0
17 18.0
Name: current, dtype: float64
所以你可以這樣做使用apply
和nested functions
import pandas as pd
ID = [2001980,2001980,2001980,2001980,2001980,2001980,2001980,2001980,2001980,2001980,2002222,2002222,2002222,2002222,2002222,2002222,2002222,2002222,]
Date = ["10/30/2017","10/29/2017","10/28/2017","10/27/2017","10/26/2017","10/25/2017","10/24/2017","10/23/2017","10/22/2017","10/21/2017","10/21/2017","10/20/2017","10/19/2017","10/18/2017","10/17/2017","10/16/2017","10/15/2017","10/14/2017",]
current = [1 ,0 ,0 ,40,39,0 ,0 ,60,0 ,0 ,0 ,0 ,16,0 ,0 ,20,19,18,]
df = pd.DataFrame({"ID": ID, "Date": Date, "current": current})
然后創建更新框架的功能
Python 3.X
def update_frame(df):
last_expected = None
def apply_logic(row):
nonlocal last_expected
last_row_id = row.name - 1
if row.name == 0:
last_expected = row["current"]
return last_expected
last_row = df.iloc[[last_row_id]].iloc[0].to_dict()
last_expected = max(last_expected-1,row['current']) if last_row['ID'] == row['ID'] else row['current']
return last_expected
return apply_logic
Python 2.X
def update_frame(df):
sd = {"last_expected": None}
def apply_logic(row):
last_row_id = row.name - 1
if row.name == 0:
sd['last_expected'] = row["current"]
return sd['last_expected']
last_row = df.iloc[[last_row_id]].iloc[0].to_dict()
sd['last_expected'] = max(sd['last_expected'] - 1,row['current']) if last_row['ID'] == row['ID'] else row['current']
return sd['last_expected']
return apply_logic
並運行如下功能
df['expected'] = df.apply(update_frame(df), axis=1)
輸出符合預期
您可以使用條件語句聯合.shift()
來獲取前行, np.where
它在的東西,以避免評論中提到AFAIK 不依賴於循環:
df['test'] = np.where(df['current'].shift() <
df['current'], df['current'] - 1, df['current'])
結果(我添加了一個'test'
列)和結果; 如果你願意,你可以改為'expected'
:
>>> df ID Date current expected test 0 2001980 10/30/2017 1 1 1 1 2001980 10/29/2017 0 0 0 2 2001980 10/28/2017 0 0 0 3 2001980 10/27/2017 40 40 39 4 2001980 10/26/2017 39 39 39 5 2001980 10/25/2017 38 38 38 6 2001980 10/24/2017 37 37 37 7 2001980 10/18/2017 0 36 0 8 2001980 10/17/2017 0 35 0 9 2001980 10/16/2017 60 60 59 10 2001980 10/15/2017 0 59 0 11 2001980 10/14/2017 0 58 0 12 2001980 10/13/2017 0 57 0 13 2001980 10/12/2017 0 56 0 14 2002222 10/21/2017 0 0 0 15 2002222 10/20/2017 0 0 0 16 2002222 10/19/2017 16 16 15 17 2002222 10/18/2017 0 15 0 18 2002222 10/17/2017 0 14 0 19 2002222 10/16/2017 13 13 12 20 2002222 10/15/2017 12 12 12 21 2002222 10/14/2017 11 11 11 22 2002222 10/13/2017 10 10 10 23 2002222 10/12/2017 9 9 9
編輯:解決OP對擴展到數百萬行的擔憂。
是的,我的原始答案不會擴展到非常大的數據幀。 但是,通過少量編輯,這種易於閱讀的解決方案將進行擴展。 隨后的優化利用了Numba中的JIT編譯器。 在導入Numba之后,我添加了jit裝飾器並修改了函數以在numpy數組上操作而不是pandas對象。 Numba意識到numpy,無法優化pandas對象。
import numba
@numba.jit
def expected(id_col, current_col):
lexp = []
lstID = 0
expected = 0
for i in range(len(id_col)):
id, current = id_col[i], current_col[i]
if id == lstID:
expected = max(current, max(expected - 1, 0))
else:
expected = current
lexp.append(expected)
lstID = id
return np.array(lexp)
要將numpy數組傳遞給函數,請使用pandas系列的.values
屬性。
df1['expected'] = expected(df1.ID.values, df1.current.values)
為了測試性能,我將原始數據幀擴展到超過100萬行。
df1 = df
while len(df1) < 1000000:
df1 = pd.concat([df1, df1])
df1.reset_index(inplace=True, drop=True)
新的變化表現非常好。
%timeit expected(df1.ID.values, df1.current.values)
44.9 ms ± 249 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
df1.shape
Out[65]: (1179648, 4)
df1.tail(15)
Out[66]:
ID Date current expected
1179633 2001980 10/27/2017 40 40
1179634 2001980 10/26/2017 39 39
1179635 2001980 10/25/2017 0 38
1179636 2001980 10/24/2017 0 37
1179637 2001980 10/23/2017 60 60
1179638 2001980 10/22/2017 0 59
1179639 2001980 10/21/2017 0 58
1179640 2002222 10/21/2017 0 0
1179641 2002222 10/20/2017 0 0
1179642 2002222 10/19/2017 16 16
1179643 2002222 10/18/2017 0 15
1179644 2002222 10/17/2017 0 14
1179645 2002222 10/16/2017 20 20
1179646 2002222 10/15/2017 19 19
1179647 2002222 10/14/2017 18 18
原始答案
有點蠻力,但很容易遵循。
def expected(df):
lexp = []
lstID = None
expected = 0
for i in range(len(df)):
id, current = df[['ID', 'current']].iloc[i]
if id == lstID:
expected = max(expected - 1, 0)
expected = max(current, expected)
else:
expected = current
lexp.append(expected)
lstID = id
return pd.Series(lexp)
產量
df['expected'] = expected(df)
df
Out[53]:
ID Date current expected
0 2001980 10/30/2017 1 1
1 2001980 10/29/2017 0 0
2 2001980 10/28/2017 0 0
3 2001980 10/27/2017 40 40
4 2001980 10/26/2017 39 39
5 2001980 10/25/2017 0 38
6 2001980 10/24/2017 0 37
7 2001980 10/23/2017 60 60
8 2001980 10/22/2017 0 59
9 2001980 10/21/2017 0 58
10 2002222 10/21/2017 0 0
11 2002222 10/20/2017 0 0
12 2002222 10/19/2017 16 16
13 2002222 10/18/2017 0 15
14 2002222 10/17/2017 0 14
15 2002222 10/16/2017 20 20
16 2002222 10/15/2017 19 19
17 2002222 10/14/2017 18 18
我相信@Tarun Lalwani指出了一個正確的方向。 那就是在DataFrame之外保存一些關鍵信息。 雖然代碼可以簡化,但只要正確管理名稱,使用全局變量沒有任何問題。 它是一種設計模式,通常可以使事情變得更簡單並提高可讀性。
cached_last = { 'expected': None, 'ID': None }
def set_expected(x):
if cached_last['ID'] is None or x.ID != cached_last['ID']:
expected = x.current
else:
expected = max(cached_last['expected'] - 1, x.current)
cached_last['ID'] = x.ID
cached_last['expected'] = expected
return expected
df['expected'] = df.apply(set_expected, axis=1)
從pandas.DataFrame.apply的文檔中,請注意apply
函數的潛在副作用。
In the current implementation apply calls func twice on the first column/row to decide whether it can take a fast or slow code path. This can lead to unexpected behavior if func has side-effects, as they will take effect twice for the first column/row.
這里的邏輯應該是有效的
lst=[]
for _, y in df.groupby('ID'):
z=[]
for i,(_, x) in enumerate(y.iterrows()):
print(x)
if x['current'] > 0:
z.append(x['current'])
else:
try:
z.append(max(z[i-1]-1,0))
except:
z.append(0)
lst.extend(z)
lst
Out[484]: [1, 0, 0, 40, 39, 38, 37, 60, 59, 58, 0, 0, 16, 15, 14, 20, 19, 18]
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.