調整不包括 NaN 的二維 numpy 數組的大小

Question

我正在嘗試調整給定因子的 2D numpy 數組的大小，在輸出中獲得較小的數組。

該數組是從圖像文件中讀取的，其中一些值應該是 NaN（不是數字，numpy 中的 np.nan）：它是衛星遙感測量的結果，只是沒有測量一些像素。

我為此找到的合適的包是 scypy.misc.imresize，但輸出數組中包含 NaN 的每個像素都設置為 NaN，即使原始像素中有一些有效數據插值在一起。

我的解決方案附在這里，我所做的基本上是：

根據原始數組形狀和所需的縮減因子創建一個新數組
創建一個索引數組來尋址原始數組的所有像素，以便對新數組中的每個像素進行平均
循環遍歷新的陣列像素，對所有非NaN像素取平均值，得到新的陣列像素值； 如果只有 NaN，則輸出將為 NaN。

我計划添加關鍵字以在不同輸出（輸入像素的平均值、中值、標准偏差等）之間進行選擇。

它按預期工作，但在 ~1Mpx 圖像上需要大約 3 秒。 由於我缺乏 Python 經驗，我正在尋求改進。

有沒有人有建議如何做得更好，更有效？

有誰知道一個已經實現了所有這些東西的庫？

謝謝。

這里有一個使用以下代碼生成的隨機像素輸入的示例輸出：

隨機像素輸入的示例輸出（見代碼）

import numpy as np
import pylab as plt
from scipy import misc

def resize_2d_nonan(array,factor):
    """
    Resize a 2D array by different factor on two axis sipping NaN values.
    If a new pixel contains only NaN, it will be set to NaN


    Parameters
    ----------

    array : 2D np array

    factor : int or tuple. If int x and y factor wil be the same

    Returns
    -------
    array : 2D np array scaled by factor

    Created on Mon Jan 27 15:21:25 2014

    @author: damo_ma
    """
    xsize, ysize = array.shape

    if isinstance(factor,int):
        factor_x = factor
        factor_y = factor
    elif isinstance(factor,tuple):
        factor_x , factor_y = factor[0], factor[1]
    else:
        raise NameError('Factor must be a tuple (x,y) or an integer')

    if not (xsize %factor_x == 0 or ysize % factor_y == 0) :
        raise NameError('Factors must be intger multiple of array shape')

    new_xsize, new_ysize = xsize/factor_x, ysize/factor_y

    new_array = np.empty([new_xsize, new_ysize])
    new_array[:] = np.nan # this saves us an assignment in the loop below

    # submatrix indexes : is the average box on the original matrix
    subrow, subcol  = np.indices((factor_x, factor_y))

     # new matrix indexs
    row, col  = np.indices((new_xsize, new_ysize))

    # some output for testing
    #for i, j, ind in zip(row.reshape(-1), col.reshape(-1),range(row.size)) :
    #    print '----------------------------------------------'
    #    print 'i: %i, j: %i, ind: %i ' % (i, j, ind)    
    #    print 'subrow+i*new_ysize, subcol+j*new_xsize :'    
    #    print i,'*',new_xsize,'=',i*factor_x
    #    print j,'*',new_ysize,'=',j*factor_y
    #    print subrow+i*factor_x,subcol+j*factor_y
    #    print '---'
    #    print 'array[subrow+i*factor_x,subcol+j*factor_y] : '    
    #    print array[subrow+i*factor_x,subcol+j*factor_y]

    for i, j, ind in zip(row.reshape(-1), col.reshape(-1),range(row.size)) :
        # define the small sub_matrix as view of input matrix subset
        sub_matrix = array[subrow+i*factor_x,subcol+j*factor_y]
        # modified from any(a) and all(a) to a.any() and a.all()
        # see https://stackoverflow.com/a/10063039/1435167
        if not (np.isnan(sub_matrix)).all(): # if we haven't all NaN
            if (np.isnan(sub_matrix)).any(): # if we haven no NaN at all
                msub_matrix = np.ma.masked_array(sub_matrix,np.isnan(sub_matrix))
                (new_array.reshape(-1))[ind] = np.mean(msub_matrix)
            else: # if we haven some NaN
                (new_array.reshape(-1))[ind] = np.mean(sub_matrix)
        # the case assign NaN if we have all NaN is missing due 
        # to the standard values of new_array

    return new_array


row , cols = 6, 4

a = 10*np.random.random_sample((row , cols))
a[0:3,0:2] = np.nan
a[0,2] = np.nan

factor_x = 2
factor_y = 2
a_misc = misc.imresize(a, .5, interp='nearest', mode='F')
a_2d_nonan = resize_2d_nonan(a,(factor_x,factor_y))

print a
print
print a_misc
print
print a_2d_nonan

plt.subplot(131)
plt.imshow(a,interpolation='nearest')
plt.title('original')
plt.xticks(arange(a.shape[1]))
plt.yticks(arange(a.shape[0]))
plt.subplot(132)
plt.imshow(a_misc,interpolation='nearest')
plt.title('scipy.misc')
plt.xticks(arange(a_misc.shape[1]))
plt.yticks(arange(a_misc.shape[0]))
plt.subplot(133)
plt.imshow(a_2d_nonan,interpolation='nearest')
plt.title('my.func')
plt.xticks(arange(a_2d_nonan.shape[1]))
plt.yticks(arange(a_2d_nonan.shape[0]))

編輯

我添加了一些修改來解決ChrisProsser 評論。

如果我用其他值替換 NaN，比如說非 NaN 像素的平均值，它將影響所有后續計算：重新采樣的原始數組和使用 NaN 替換的重新采樣的數組之間的差異表明 2 個像素更改了它們的值.

我的目標只是跳過所有 NaN 像素。

# substitute NaN with the average value 

ind_nonan , ind_nan = np.where(np.isnan(a) == False), np.where(np.isnan(a) == True)
a_substitute = np.copy(a)

a_substitute[ind_nan] = np.mean(a_substitute[ind_nonan]) # substitute the NaN with average on the not-Nan

a_substitute_misc = misc.imresize(a_substitute, .5, interp='nearest', mode='F')
a_substitute_2d_nonan = resize_2d_nonan(a_substitute,(factor_x,factor_y))

print a_2d_nonan-a_substitute_2d_nonan

[[        nan -0.02296697]
 [ 0.23143208  0.        ]
 [ 0.          0.        ]]

在此處輸入圖片說明

** 第二次編輯**

為了解決Hooked的答案，我添加了一些額外的代碼。 這是一個迭代的想法，遺憾的是它在應該為“空”（NaN）的像素上插入新值，並且對於我的小例子，生成的 NaN 比好的值多。

X , Y  = np.indices((row , cols))
X_new , Y_new  = np.indices((row/factor_x , cols/factor_y))

from scipy.interpolate import CloughTocher2DInterpolator as intp
C = intp((X[ind_nonan],Y[ind_nonan]),a[ind_nonan])

a_interp = C(X_new , Y_new)

print a
print
print a_interp

[[        nan,         nan],
 [        nan,         nan],
 [        nan,  6.32826577]])

在此處輸入圖片說明

Answer 1

使用scipy.interpolate在不同的網格上插入點。 下面我展示了一個三次插值器，它更慢但可能更准確。 您會注意到此函數缺少角像素，然后您可以使用線性或最近鄰插值來處理這些最后的值。

在此處輸入圖片說明

import numpy as np
import pylab as plt

# Test data
row = np.linspace(-3,3,50)
X,Y = np.meshgrid(row,row)
Z = np.sqrt(X**2+Y**2) + np.cos(Y) 

# Make some dead pixels, favor an edge
dead = np.random.random(Z.shape)
dead = (dead*X>.7)
Z[dead] =np.nan

from scipy.interpolate import CloughTocher2DInterpolator as intp
C = intp((X[~dead],Y[~dead]),Z[~dead])

new_row = np.linspace(-3,3,25)
xi,yi   = np.meshgrid(new_row,new_row)
zi = C(xi,yi)

plt.subplot(121)
plt.title("Original signal 50x50")
plt.imshow(Z,interpolation='nearest')

plt.subplot(122)
plt.title("Interpolated signal 25x25")
plt.imshow(zi,interpolation='nearest')

plt.show()

Answer 2

您正在陣列的小窗口上操作。 不是循環遍歷數組來制作窗口，而是可以通過操縱其步幅來有效地重構數組。 numpy 庫提供了as_strided()函數來幫助解決這個問題。 生命游戲的 SciPy CookBook Stride 技巧中提供了一個示例。

下面將使用一個廣義的滑動窗口函數，我將在最后包含它。

確定新數組的形狀：

rows, cols = a.shape
new_shape = rows / 2, cols / 2

將數組重組為您需要的窗口，並創建一個標識 NaN 的索引數組：

# 2x2 windows of the original array
windows = sliding_window(a, (2,2))
# make a windowed boolean array for indexing
notNan = sliding_window(np.logical_not(np.isnan(a)), (2,2))

可以使用列表推導式或生成器表達式來創建新數組。

# using a list comprehension
# make a list of the means of the windows, disregarding the Nan's
means = [window[index].mean() for window, index in zip(windows, notNan)]
new_array = np.array(means).reshape(new_shape)

# generator expression
# produces the means of the windows, disregarding the Nan's
means = (window[index].mean() for window, index in zip(windows, notNan))
new_array = np.fromiter(means, dtype = np.float32).reshape(new_shape)

生成器表達式應該節省內存。 如果內存有問題，使用itertools.izip()而不是 ```zip`` 也應該會有所幫助。 我只是將列表理解用於您的解決方案。

你的功能：

def resize_2d_nonan(array,factor):
    """
    Resize a 2D array by different factor on two axis skipping NaN values.
    If a new pixel contains only NaN, it will be set to NaN

    Parameters
    ----------
    array : 2D np array

    factor : int or tuple. If int x and y factor wil be the same

    Returns
    -------
    array : 2D np array scaled by factor

    Created on Mon Jan 27 15:21:25 2014

    @author: damo_ma
    """
    xsize, ysize = array.shape

    if isinstance(factor,int):
        factor_x = factor
        factor_y = factor
        window_size = factor, factor
    elif isinstance(factor,tuple):
        factor_x , factor_y = factor
        window_size = factor
    else:
        raise NameError('Factor must be a tuple (x,y) or an integer')

    if (xsize % factor_x or ysize % factor_y) :
        raise NameError('Factors must be integer multiple of array shape')

    new_shape = xsize / factor_x, ysize / factor_y

    # non-overlapping windows of the original array
    windows = sliding_window(a, window_size)
    # windowed boolean array for indexing
    notNan = sliding_window(np.logical_not(np.isnan(a)), window_size)

    #list of the means of the windows, disregarding the Nan's
    means = [window[index].mean() for window, index in zip(windows, notNan)]
    # new array
    new_array = np.array(means).reshape(new_shape)

    return new_array

我沒有和你的原始函數做過任何時間比較，但它應該更快。

我在這里看到的許多解決方案都將操作向量化以提高速度/效率 - 我對此不太了解，也不知道它是否可以應用於您的問題。 搜索 SO for window、array、moving average、vectorize 和 numpy 應該會產生類似的問題和答案以供參考。

sliding_window()見下面的屬性：

import numpy as np
from numpy.lib.stride_tricks import as_strided as ast
from itertools import product

def norm_shape(shape):
    '''
    Normalize numpy array shapes so they're always expressed as a tuple, 
    even for one-dimensional shapes.
     
    Parameters
        shape - an int, or a tuple of ints
     
    Returns
        a shape tuple
    '''
    try:
        i = int(shape)
        return (i,)
    except TypeError:
        # shape was not a number
        pass
 
    try:
        t = tuple(shape)
        return t
    except TypeError:
        # shape was not iterable
        pass
     
    raise TypeError('shape must be an int, or a tuple of ints')
 

def sliding_window(a,ws,ss = None,flatten = True):
    '''
    Return a sliding window over a in any number of dimensions
     
    Parameters:
        a  - an n-dimensional numpy array
        ws - an int (a is 1D) or tuple (a is 2D or greater) representing the size 
             of each dimension of the window
        ss - an int (a is 1D) or tuple (a is 2D or greater) representing the 
             amount to slide the window in each dimension. If not specified, it
             defaults to ws.
        flatten - if True, all slices are flattened, otherwise, there is an 
                  extra dimension for each dimension of the input.
     
    Returns
        an array containing each n-dimensional window from a
    '''
     
    if None is ss:
        # ss was not provided. the windows will not overlap in any direction.
        ss = ws
    ws = norm_shape(ws)
    ss = norm_shape(ss)
     
    # convert ws, ss, and a.shape to numpy arrays so that we can do math in every 
    # dimension at once.
    ws = np.array(ws)
    ss = np.array(ss)
    shape = np.array(a.shape)
     
     
    # ensure that ws, ss, and a.shape all have the same number of dimensions
    ls = [len(shape),len(ws),len(ss)]
    if 1 != len(set(ls)):
        raise ValueError(\
        'a.shape, ws and ss must all have the same length. They were %s' % str(ls))
     
    # ensure that ws is smaller than a in every dimension
    if np.any(ws > shape):
        raise ValueError(\
        'ws cannot be larger than a in any dimension.\
 a.shape was %s and ws was %s' % (str(a.shape),str(ws)))
     
    # how many slices will there be in each dimension?
    newshape = norm_shape(((shape - ws) // ss) + 1)
    # the shape of the strided array will be the number of slices in each dimension
    # plus the shape of the window (tuple addition)
    newshape += norm_shape(ws)
    # the strides tuple will be the array's strides multiplied by step size, plus
    # the array's strides (tuple addition)
    newstrides = norm_shape(np.array(a.strides) * ss) + a.strides
    strided = ast(a,shape = newshape,strides = newstrides)
    if not flatten:
        return strided
     
    # Collapse strided so that it has one more dimension than the window.  I.e.,
    # the new array is a flat list of slices.
    meat = len(ws) if ws.shape else 0
    firstdim = (np.product(newshape[:-meat]),) if ws.shape else ()
    dim = firstdim + (newshape[-meat:])
    # remove any dimensions with size 1
    dim = filter(lambda i : i != 1,dim)
    return strided.reshape(dim)

滑動窗口（）屬性
我最初在一個博客頁面上發現了這個，現在是一個斷開的鏈接：

使用 Numpy 高效重疊 Windows - http://www.johnvinyard.com/blog/?p=268

稍微搜索一下，它看起來現在位於Zounds github 存儲庫中。 謝謝約翰·溫亞德。

請注意，這篇文章已經很舊了，並且有很多關於滑動窗口、滾動窗口和圖像補丁提取的 SO Q&A。 有很多一次性使用 numpy 的 as_strided，但這個函數似乎仍然是唯一一個處理 nd 窗口的函數。 scikits sklearn.feature_extraction.image 庫似乎經常被引用用於提取或查看圖像補丁。

調整不包括 NaN 的二維 numpy 數組的大小

問題描述

2 個解決方案

解決方案1
2 2014-02-03 15:38:32

解決方案2
2 已采納 2014-02-08 21:02:36

調整不包括 NaN 的二維 numpy 數組的大小

問題描述

2 個解決方案

解決方案1 2 2014-02-03 15:38:32

解決方案2 2 已采納 2014-02-08 21:02:36

解決方案1
2 2014-02-03 15:38:32

解決方案2
2 已采納 2014-02-08 21:02:36