簡體   English   中英

如何有效地按行插值Pandas DataFrame中的數據?

[英]How to efficiently interpolate data in a Pandas DataFrame row-wise?

我有數千個“觀測”。 每個觀測值都由位置(x,y)和傳感器讀數(z)組成,請參見以下示例。

在此處輸入圖片說明

我想將雙線性表面擬合到x,y和z數據。 我目前正在使用amroamroamro / gist的代碼段來執行此操作

def bi2Dlinter(xdata, ydata, zdata, gridrez):
    X,Y = np.meshgrid(
             np.linspace(min(x), max(x), endpoint=True, num=gridrez),
             np.linspace(min(y), max(y), endpoint=True, num=gridrez))  
    A = np.c_[xdata, ydata, np.ones(len(zdata))]
    C,_,_,_ = scipy.linalg.lstsq(A, zdata)
    Z = C[0]*X + C[1]*Y + C[2]
    return Z

在此處輸入圖片說明

我當前的方法是遍歷DataFrame的行。 (這對於1000個觀察非常有用,但不適用於較大的數據集。)

ZZ = []
for index, row in df2.iterrows():
    x=row['x1'], row['x2'], row['x3'], row['x4'], row['x5']
    y=row['y1'], row['y2'], row['y3'], row['y4'], row['y5']
    z=row['z1'], row['z2'], row['z3'], row['z4'], row['z5']
    ZZ.append(np.median(bi2Dlinter(x,y,z,gridrez)))
df2['ZZ']=ZZ

如果沒有更有效的方法來做到這一點,我會感到驚訝。 有沒有辦法向量化線性插值?

我將代碼放在這里 ,這也會生成虛擬條目。 謝謝

通常不建議像這樣在DataFrames上循環。 相反,您應該選擇盡可能地使代碼向量化。

首先,我們為您的輸入創建一個數組

x_vals = df2[['x1','x2','x3','x4','x5']].values
y_vals = df2[['y1','y2','y3','y4','y5']].values
z_vals = df2[['z1','z2','z3','z4','z5']].values

接下來,我們需要創建一個bi2Dlinter函數來處理矢量輸入,這涉及到更改linspace / meshgrid以使其適用於數組並更改minimum_squares函數。 通常scipy.linalg函數可在數組上工作,但據我所知.lstsq方法不起作用。 相反,我們可以使用.SVD在數組上復制相同的功能。

def create_ranges(start, stop, N, endpoint=True):
    if endpoint==1:
        divisor = N-1
    else:
        divisor = N
    steps = (1.0/divisor) * (stop - start)
    return steps[:,None]*np.arange(N) + start[:,None]

def linspace_nd(x,y,gridrez):
    a1 = create_ranges(x.min(axis=1), x.max(axis=1), N=gridrez, endpoint=True)
    a2 = create_ranges(y.min(axis=1), y.max(axis=1), N=gridrez, endpoint=True)
    out_shp = a1.shape + (a2.shape[1],)
    Xout = np.broadcast_to(a1[:,None,:], out_shp)
    Yout = np.broadcast_to(a2[:,:,None], out_shp)
    return Xout, Yout

def stacked_lstsq(L, b, rcond=1e-10):
    """
    Solve L x = b, via SVD least squares cutting of small singular values
    L is an array of shape (..., M, N) and b of shape (..., M).
    Returns x of shape (..., N)
    """
    u, s, v = np.linalg.svd(L, full_matrices=False)
    s_max = s.max(axis=-1, keepdims=True)
    s_min = rcond*s_max
    inv_s = np.zeros_like(s)
    inv_s[s >= s_min] = 1/s[s>=s_min]
    x = np.einsum('...ji,...j->...i', v,
                  inv_s * np.einsum('...ji,...j->...i', u, b.conj()))
    return np.conj(x, x)

def vectorized_bi2Dlinter(x_vals, y_vals, z_vals, gridrez):

    X,Y = linspace_nd(x_vals, y_vals, gridrez)
    A = np.stack((x_vals,y_vals,np.ones_like(z_vals)), axis=2)
    C = stacked_lstsq(A, z_vals)
    n_bcast = C.shape[0]
    return C.T[0].reshape((n_bcast,1,1))*X + C.T[1].reshape((n_bcast,1,1))*Y + C.T[2].reshape((n_bcast,1,1))

對n = 10000行的數據進行測試后,矢量化函數的速度明顯加快。

%%timeit
ZZ = []
for index, row in df2.iterrows():
    x=row['x1'], row['x2'], row['x3'], row['x4'], row['x5']
    y=row['y1'], row['y2'], row['y3'], row['y4'], row['y5']
    z=row['z1'], row['z2'], row['z3'], row['z4'], row['z5']
    ZZ.append((bi2Dlinter(x,y,z,gridrez)))
df2['ZZ']=ZZ

Out: 5.52 s ± 17.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
res = vectorized_bi2Dlinter(x_vals,y_vals,z_vals,gridrez)

Out: 74.6 ms ± 159 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

在此處輸入圖片說明 您應該仔細注意此向量化函數中發生的事情,並熟悉numpy中的廣播。 我不能相信前三個功能,相反,我將從堆棧溢出中鏈接它們的答案,以使您有所了解。

向量化的NumPy Linspace用於多個起始值和終止值

如何使用矢量化代碼解決許多超定線性方程組?

如何對數組正確使用numpy.c_

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM