如何使用pandas列中的前N個值來填充NaN？

Question

說我有一個時間序列數據如下。

 df
       priceA    priceB
  0     25.67    30.56
  1     34.12    28.43
  2     37.14    29.08
  3     Nan       34.23
  4     32          Nan
  5     18.75    41.1
  6     Nan       45.12
  7     23          39.67
  8     Nan       36.45
  9      36         Nan

現在我想通過取列中前N個值的平均值來填充列價A中的NaN。 在這種情況下，取N = 3。 對於柱價B，我必須通過上面的M行（當前索引-M）填充Nan。

我試着為它編寫循環，這不是一個好習慣，因為我的數據太大了。 有一個更好的方法嗎？

N=3
M=2
def fillPriceA( df,indexval,n):
      temp=[ ]
      for i in range(n):
          if i < 0:
                continue
          temp.append(df.loc[indexval-(i+1), 'priceA'])

      return np.nanmean(np.array(temp, dtype=np.float))

def fillPriceB(df,  indexval, m):
        return df.loc[indexval-m, 'priceB']

for idx, rows for df.iterrows():
         if idx< N: 
               continue
         else:
                if rows['priceA']==None:
                     rows['priceA']= fillPriceA(df, idx,N)
                if rows['priceB']==None:
                     rows['priceB']=fillPrriceB(df,idx,M)

預期產量：

        priceA      priceB
0      25.67        30.56
1      34.12        28.43
2      37.14        29.08
3      32.31        34.23
4      32             29.08
5      18.75       41.1
6       27.68      45.12
7       23            39.67
8       23.14      36.45
9       36            39.67

Answer 1

解決方案可能只適用於nan索引（請參閱dataframe boolean indexing ）：

param = dict(priceA = 3, priceB = 2) #Number of previous values to consider

for col in df.columns:
    for i in df[np.isnan(df[col])].index: #Iterate over nan index 
        _window = df.iloc[max(0,(i-param[col])):i][col] #get the nth expected elements
        df.loc[i][col] = _window.mean() if col == 'priceA' else _window.iloc[0] #Replace with right method

print(df)

結果：

      priceA  priceB
0  25.670000   30.56
1  34.120000   28.43
2  37.140000   29.08
3  32.310000   34.23
4  32.000000   29.08
5  18.750000   41.10
6  27.686667   45.12
7  23.000000   39.67
8  23.145556   36.45
9  36.000000   39.67

注意
1.使用np.isnan()意味着您的列是數字的。 如果之前沒有使用pd.to_numeric()轉換列：

...
for col in df.columns:
    df[col] = pd.to_numeric(df[col], errors = 'coerce')
    ...

或者使用pd.isnull()代替（參見下面的示例）。 注意表演（ numpy更快）：

from random import randint

#A sample with 10k elements and some np.nan
arr = np.random.rand(10000)
for i in range(100):
    arr[randint(0,9999)] = np.nan

#Performances
%timeit pd.isnull(arr)
10000 loops, best of 3: 24.8 µs per loop

%timeit np.isnan(arr)
100000 loops, best of 3: 5.6 µs per loop

2.更通用的替代方法可以是定義應用於dict每列的方法和窗口大小：

import pandas as pd

param = {}
param['priceA'] = {'n':3,
                   'method':lambda x: pd.isnull(x)}

param['priceB'] = {'n':2,
                   'method':lambda x: x[0]}

param包含現在n元件的數量和method lambda表達式。 相應地重寫你的循環：

for col in df.columns:
    for i in df[np.isnan(df[col])].index: #Iterate over nan index 
        _window = df.iloc[max(0,(i-param[col]['n'])):i][col] #get the nth expected elements
        df.loc[i][col] = param[col]['method'](_window.values) #Replace with right method

print(df)#This leads to a similar result.

Answer 2

您可以使用NA掩碼來執行每列所需的操作：

import pandas as pd
import numpy as np

df = pd.DataFrame({'a': [1,2,3,4, None, 5, 6], 'b': [1, None, 2, 3, 4, None, 7]})
df

#     a b
# 0 1.0 1.0
# 1 2.0 NaN
# 2 3.0 2.0
# 3 4.0 3.0
# 4 NaN 4.0
# 5 5.0 NaN
# 6 6.0 7.0

for col in df.columns:
    s = df[col]
    na_indices = s[s.isnull()].index.tolist()
    prev = 0
    for k in na_indices:
        s[k] = np.mean(s[prev:k])
        prev = k

    df[col] = s

print(df)

    a   b
# 0 1.0 1.0
# 1 2.0 1.0
# 2 3.0 2.0
# 3 4.0 3.0
# 4 2.5 4.0
# 5 5.0 2.5
# 6 6.0 7.0

雖然這仍然是一個自定義操作，但我很確定它會稍微快一點，因為它不是遍歷每一行，只是超過NA值，我假設與實際數據相比稀疏

Answer 3

要填寫價格A使用rolling ，然后shift並將此結果用於fillna ，

# make some data
df = pd.DataFrame({'priceA': range(10)})

#make some rows missing
df.loc[[4, 6], 'priceA'] = np.nan

n = 3

df.priceA = df.priceA.fillna(df.priceA.rolling(n, min_periods=1).mean().shift(1))

這里唯一的邊緣情況是當兩個nans在n的范圍內時，但它似乎在你的問題中處理這個問題。

對於priceB只需使用shift ，

df = pd.DataFrame({'priceB': range(10)})
df.loc[[4, 8], 'priceB'] = np.nan

m = 2

df.priceB = df.priceB.fillna(df.priceB.shift(m))

像以前一樣，存在邊緣情況，其中在另一個納米之前存在正好m納米。

如何使用pandas列中的前N個值來填充NaN？

問題描述

3 個解決方案

解決方案1
2 已采納 2018-01-18 10:52:16

解決方案2
1 2018-01-18 10:11:05

解決方案3
0 2018-01-18 12:12:14

如何使用pandas列中的前N個值來填充NaN？

問題描述

3 個解決方案

解決方案1 2 已采納 2018-01-18 10:52:16

解決方案2 1 2018-01-18 10:11:05

解決方案3 0 2018-01-18 12:12:14

解決方案1
2 已采納 2018-01-18 10:52:16

解決方案2
1 2018-01-18 10:11:05

解決方案3
0 2018-01-18 12:12:14