Say I have a time series data as below.
df
priceA priceB
0 25.67 30.56
1 34.12 28.43
2 37.14 29.08
3 Nan 34.23
4 32 Nan
5 18.75 41.1
6 Nan 45.12
7 23 39.67
8 Nan 36.45
9 36 Nan
Now I want to fill NaNs in column priceA by taking mean of previous N values in the column. In this case take N=3. And for column priceB I have to fill Nan by value M rows above(current index-M).
I tried to write for loop for it which is not a good practice as my data is too large. Is there a better way to do this?
N=3
M=2
def fillPriceA( df,indexval,n):
temp=[ ]
for i in range(n):
if i < 0:
continue
temp.append(df.loc[indexval-(i+1), 'priceA'])
return np.nanmean(np.array(temp, dtype=np.float))
def fillPriceB(df, indexval, m):
return df.loc[indexval-m, 'priceB']
for idx, rows for df.iterrows():
if idx< N:
continue
else:
if rows['priceA']==None:
rows['priceA']= fillPriceA(df, idx,N)
if rows['priceB']==None:
rows['priceB']=fillPrriceB(df,idx,M)
Expected output:
priceA priceB
0 25.67 30.56
1 34.12 28.43
2 37.14 29.08
3 32.31 34.23
4 32 29.08
5 18.75 41.1
6 27.68 45.12
7 23 39.67
8 23.14 36.45
9 36 39.67
A solution could be to only work with the nan
index (see dataframe boolean indexing ):
param = dict(priceA = 3, priceB = 2) #Number of previous values to consider
for col in df.columns:
for i in df[np.isnan(df[col])].index: #Iterate over nan index
_window = df.iloc[max(0,(i-param[col])):i][col] #get the nth expected elements
df.loc[i][col] = _window.mean() if col == 'priceA' else _window.iloc[0] #Replace with right method
print(df)
Result:
priceA priceB
0 25.670000 30.56
1 34.120000 28.43
2 37.140000 29.08
3 32.310000 34.23
4 32.000000 29.08
5 18.750000 41.10
6 27.686667 45.12
7 23.000000 39.67
8 23.145556 36.45
9 36.000000 39.67
Note
1. Using np.isnan()
implies that your columns are numeric. If not convert your columns before with pd.to_numeric()
:
...
for col in df.columns:
df[col] = pd.to_numeric(df[col], errors = 'coerce')
...
Or use pd.isnull()
instead (see example below). Be aware of the performances ( numpy
is faster):
from random import randint
#A sample with 10k elements and some np.nan
arr = np.random.rand(10000)
for i in range(100):
arr[randint(0,9999)] = np.nan
#Performances
%timeit pd.isnull(arr)
10000 loops, best of 3: 24.8 µs per loop
%timeit np.isnan(arr)
100000 loops, best of 3: 5.6 µs per loop
2. A more generic alternative could be to define methods and window size to apply for each column in a dict
:
import pandas as pd
param = {}
param['priceA'] = {'n':3,
'method':lambda x: pd.isnull(x)}
param['priceB'] = {'n':2,
'method':lambda x: x[0]}
param
contains now n
the number of elements and method
a lambda expression. Accordingly rewrite your loops:
for col in df.columns:
for i in df[np.isnan(df[col])].index: #Iterate over nan index
_window = df.iloc[max(0,(i-param[col]['n'])):i][col] #get the nth expected elements
df.loc[i][col] = param[col]['method'](_window.values) #Replace with right method
print(df)#This leads to a similar result.
You can use an NA mask to do what you need per column:
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': [1,2,3,4, None, 5, 6], 'b': [1, None, 2, 3, 4, None, 7]})
df
# a b
# 0 1.0 1.0
# 1 2.0 NaN
# 2 3.0 2.0
# 3 4.0 3.0
# 4 NaN 4.0
# 5 5.0 NaN
# 6 6.0 7.0
for col in df.columns:
s = df[col]
na_indices = s[s.isnull()].index.tolist()
prev = 0
for k in na_indices:
s[k] = np.mean(s[prev:k])
prev = k
df[col] = s
print(df)
a b
# 0 1.0 1.0
# 1 2.0 1.0
# 2 3.0 2.0
# 3 4.0 3.0
# 4 2.5 4.0
# 5 5.0 2.5
# 6 6.0 7.0
While this is still a custom operation, I am pretty sure it will be slightly faster because it is not iterating over each row, just over the NA values, which I am assuming will be sparse compared to the actual data
To fill priceA use rolling
, then shift
and use this result in fillna
,
# make some data
df = pd.DataFrame({'priceA': range(10)})
#make some rows missing
df.loc[[4, 6], 'priceA'] = np.nan
n = 3
df.priceA = df.priceA.fillna(df.priceA.rolling(n, min_periods=1).mean().shift(1))
The only edge case here is when two nans are within n
of one another but it seems to handle this as in your question.
For priceB just use shift
,
df = pd.DataFrame({'priceB': range(10)})
df.loc[[4, 8], 'priceB'] = np.nan
m = 2
df.priceB = df.priceB.fillna(df.priceB.shift(m))
Like before, there is the edge case where there is a nan exactly m
before another nan.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.