[英]python time series lag by shift(1), how to fillna for the created NaN
I have aa very large dataset containing id and data points of time series (with some missing values). 我有一个非常大的数据集,其中包含时间序列的ID和数据点(有些缺失值)。 The following is just an example. 以下仅是示例。
I will need to create a lag variable for both group which of course will create NaN for the first observation for each group. 我将需要为两个组都创建一个滞后变量,这当然会为每个组的第一次观察创建NaN。 I would like to assign the next available value to the created NaN specifically but leave other missing value untouched for later manipulation. 我想将下一个可用值专门分配给创建的NaN,但其他丢失的值保持不变,以备以后使用。
id time value lag_value
A 2000 10 NaN # I want this to be 10, the next available value
A 2001 11 10
A 2002 NaN 11
A 2003 14 NaN
A 2004 10 14
Edit: 编辑:
I think it would be cleaner to use first_valid_index
to assign the next available value, see Pandas - find first non-null value in column 我认为使用first_valid_index
分配下一个可用值会更干净,请参阅first_valid_index
在列中查找第一个非空值
Here you go, this will fill the first value with the first non NaN
entry from the original list. 在这里,您将用原始列表中的第一个非NaN
条目填充第一个值。
import pandas as pd
import numpy as np
df = pd.DataFrame({'id': ['A', 'A', 'A', 'A', 'A'],
'time': [2000, 2001, 2002, 2003, 2004],
'value': [10, 11, np.NaN, 14, 10]})
df['lag_value'] = df.value.shift(1)
df.loc[0, 'lag_value'] = df.lag_value[df.lag_value.notnull()].values[0]
# id time value lag_value
#0 A 2000 10.0 10.0
#1 A 2001 11.0 10.0
#2 A 2002 NaN 11.0
#3 A 2003 14.0 NaN
#4 A 2004 10.0 14.0
Since you mention first_valid_index
由于您提到first_valid_index
s=df.value.shift()
s.iloc[s.first_valid_index()-1]=df.value.iloc[0]
s
Out[110]:
0 10.0
1 10.0
2 11.0
3 NaN
4 14.0
Name: value, dtype: float64
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.