简体   繁体   English

Numpy:用前一行的值填充 NaN

[英]Numpy: Fill NaN with values from previous row

I need to replace NaN with values from the previous row except for the first row where NaN values are replaced with zero.我需要用前一行中的值替换 NaN,但第一行中的 NaN 值被替换为零。 What would be the most efficient solution?什么是最有效的解决方案?

Sample input, output -样品输入,output -

In [179]: arr
Out[179]: 
array([[  5.,  nan,  nan,   7.,   2.,   6.,   5.],
       [  3.,  nan,   1.,   8.,  nan,   5.,  nan],
       [  4.,   9.,   6.,  nan,  nan,  nan,   7.]])

In [180]: out
Out[180]: 
array([[ 5.,  0,  0.,  7.,  2.,  6.,  5.],
       [ 3.,  0,  1.,  8.,  2.,  5.,  5.],
       [ 4.,  9.,  6.,  8.,  2.,  6.,  7.]])

( EDIT to include a (partially?) vectorized approach) 编辑包括(部分?)矢量化方法)

( EDIT2 to include some timings) EDIT2包括一些时间)

The simplest solution matching your required input/output is by looping through the rows:匹配所需输入/输出的最简单解决方案是遍历行:

import numpy as np


def ffill_loop(arr, fill=0):
    mask = np.isnan(arr[0])
    arr[0][mask] = fill
    for i in range(1, len(arr)):
        mask = np.isnan(arr[i])
        arr[i][mask] = arr[i - 1][mask]
    return arr


print(ffill_loop(arr.copy()))
# [[5. 0. 0. 7. 2. 6. 5.]
#  [3. 0. 1. 8. 2. 5. 5.]
#  [4. 9. 6. 8. 2. 5. 7.]]

You could also use a vectorized approach which may come faster for larger inputs (the fewer the nan below each other, the better):您还可以使用矢量化方法,对于较大的输入可能会更快(彼此下方的nan越少越好):

import numpy as np


def ffill_roll(arr, fill=0, axis=0):
    mask = np.isnan(arr)
    replaces = np.roll(arr, 1, axis)
    slicing = tuple(0 if i == axis else slice(None) for i in range(arr.ndim))
    replaces[slicing] = fill
    while np.count_nonzero(mask) > 0:
        arr[mask] = replaces[mask]
        mask = np.isnan(arr)
        replaces = np.roll(replaces, 1, axis)
    return arr


print(ffill_roll(arr.copy()))
# [[5. 0. 0. 7. 2. 6. 5.]
#  [3. 0. 1. 8. 2. 5. 5.]
#  [4. 9. 6. 8. 2. 5. 7.]]

Timing these function one would get (including the loop-less solution proposed in @Divakar's answer ):计时这些 function 会得到(包括@Divakar的答案中提出的无循环解决方案):

import numpy as np
from numpy import nan


funcs = ffill_loop, ffill_roll, ffill_cols
sep = ' ' * 4
print(f'{"shape":15s}', end=sep)
for func in funcs:
    print(f'{func.__name__:>15s}', end=sep)
print()
for n in (1, 5, 10, 50, 100, 500, 1000, 2000):
    k = l = n
    arr = np.array([[  5.,  nan,  nan,   7.,   2.,   6.,   5.] * k,
        [  3.,  nan,   1.,   8.,  nan,   5.,  nan] * k,
        [  4.,   9.,   6.,  nan,  nan,  nan,   7.] * k] * l)
    print(f'{arr.shape!s:15s}', end=sep)
    for func in funcs:
        result = %timeit -q -o func(arr.copy())
        print(f'{result.best * 1e3:12.3f} ms', end=sep)
    print()
shape                   ffill_loop         ffill_roll         ffill_cols    
(3, 7)                    0.009 ms           0.063 ms           0.026 ms    
(15, 35)                  0.043 ms           0.074 ms           0.034 ms    
(30, 70)                  0.092 ms           0.098 ms           0.055 ms    
(150, 350)                0.783 ms           0.939 ms           0.786 ms    
(300, 700)                2.409 ms           4.060 ms           3.829 ms    
(1500, 3500)             49.447 ms         105.379 ms         169.649 ms    
(3000, 7000)            169.799 ms         340.548 ms         759.854 ms    
(6000, 14000)           656.982 ms        1369.651 ms        1610.094 ms    

Indicating that ffill_loop() is actually the fastest for the given inputs most of the times.表明ffill_loop()在大多数情况下实际上是给定输入最快的。 Instead ffill_cols() gets progressively to be the slowest approach as the input size increases.相反,随着输入大小的增加, ffill_cols()逐渐成为最慢的方法。

Here's a vectorized NumPy based one inspired by Most efficient way to forward-fill NaN values in numpy array's answer post -这是一个矢量化的 NumPy,其灵感来自Most efficient way to forward-fill NaN values in numpy array's answer post -

def ffill_cols(a, startfillval=0):
    mask = np.isnan(a)
    tmp = a[0].copy()
    a[0][mask[0]] = startfillval
    mask[0] = False
    idx = np.where(~mask,np.arange(mask.shape[0])[:,None],0)
    out = np.take_along_axis(a,np.maximum.accumulate(idx,axis=0),axis=0)
    a[0] = tmp
    return out

Sample run -样品运行 -

In [2]: a
Out[2]: 
array([[ 5., nan, nan,  7.,  2.,  6.,  5.],
       [ 3., nan,  1.,  8., nan,  5., nan],
       [ 4.,  9.,  6., nan, nan, nan,  7.]])

In [3]: ffill_cols(a)
Out[3]: 
array([[5., 0., 0., 7., 2., 6., 5.],
       [3., 0., 1., 8., 2., 5., 5.],
       [4., 9., 6., 8., 2., 5., 7.]])
import numpy as np
arr = np.array([[  5.,  np.nan,  np.nan,   7.,   2.,   6.,   5.],
                [  3.,  np.nan,   1.,   8.,  np.nan,   5.,  np.nan],
                [  4.,   9.,   6.,  np.nan,  np.nan,  np.nan,   7.]])

nan_indices = np.isnan(arr)

Where nan_indices gives you: nan_indices 给你的地方:

array([[False,  True,  True, False, False, False, False],
       [False,  True, False, False,  True, False,  True],
       [False, False, False,  True,  True,  True, False]])

Now it's just a matter of replacing the values using the logic you mentioned in the question:现在只需使用您在问题中提到的逻辑替换值即可:

arr[0, nan_indices[0, :]] = 0

for row in range(1, np.shape(arr)[0]):
    arr[row, nan_indices[row, :]] = arr[row - 1, nan_indices[row, :]] 

Now arr is:现在 arr 是:

array([[5., 0., 0., 7., 2., 6., 5.],
       [3., 0., 1., 8., 2., 5., 5.],
       [4., 9., 6., 8., 2., 5., 7.]])

How about this?这个怎么样?

import numpy as np

x = np.array([[  5.,  np.nan,  np.nan,   7.,   2.,   6.,   5.],
             [  3.,  np.nan,   1.,   8.,  np.nan,   5.,  np.nan],
             [  4.,   9.,   6.,  np.nan,  np.nan,  np.nan,   7.]])

def fillnans(a):
    a[0, np.isnan(a[0,:])] = 0
    while np.any(np.isnan(a)):
        a[np.isnan(a)] = np.roll(a, 1, 0)[np.isnan(a)]
    return a

print(x)
print(fillnans(x))

Output Output

[[ 5. nan nan  7.  2.  6.  5.]
 [ 3. nan  1.  8. nan  5. nan]
 [ 4.  9.  6. nan nan nan  7.]]
[[5. 0. 0. 7. 2. 6. 5.]
 [3. 0. 1. 8. 2. 5. 5.]
 [4. 9. 6. 8. 2. 5. 7.]]

I hope this helps!我希望这有帮助!

from numpy import *

a = array([[5.,  nan,  nan,   7.,   2.,   6.,   5.],
[3.,  nan,   1.,   8.,  nan,   5.,  nan],
[4.,   9.,   6.,  nan,  nan,  nan,   7.]])

replace nan with zeros in first row在第一行用零替换 nan

where_are_NaNs = isnan(a[0])
a[0][where_are_NaNs] = 0

replace nan in other rows替换其他行中的 nan

where_are_NaNs = isnan(a)
for i in range(len(where_are_NaNs)):
    for j in range(len(where_are_NaNs[0])):
        if(where_are_NaNs[i][j]):
            a[i][j] = a[i-1][j]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM