How to efficiently interpolate data in a Pandas DataFrame row-wise?

Question

I have several thousand "observations". Each observation consists of location (x,y) and sensor reading (z), see example below.

I would like to fit a bi-linear surface to the x,y, and z data. I am currently doing it with the code-snippet from amroamroamro/gist :

def bi2Dlinter(xdata, ydata, zdata, gridrez):
    X,Y = np.meshgrid(
             np.linspace(min(x), max(x), endpoint=True, num=gridrez),
             np.linspace(min(y), max(y), endpoint=True, num=gridrez))  
    A = np.c_[xdata, ydata, np.ones(len(zdata))]
    C,_,_,_ = scipy.linalg.lstsq(A, zdata)
    Z = C[0]*X + C[1]*Y + C[2]
    return Z

My current approach is to cycle through the rows of the DataFrame. (This works great for 1000 observations but is not usable for larger data-sets.)

ZZ = []
for index, row in df2.iterrows():
    x=row['x1'], row['x2'], row['x3'], row['x4'], row['x5']
    y=row['y1'], row['y2'], row['y3'], row['y4'], row['y5']
    z=row['z1'], row['z2'], row['z3'], row['z4'], row['z5']
    ZZ.append(np.median(bi2Dlinter(x,y,z,gridrez)))
df2['ZZ']=ZZ

I would be surprised if there is not a more efficient way to do this. Is there a way to vectorize the linear interpolation?

I put the code here which also generates dummy entries. Thanks

Answer 1

Looping over DataFrames like this is generally not recommended. Instead you should opt to try and vectorize your code as much as possible.

First we create an array for your inputs

x_vals = df2[['x1','x2','x3','x4','x5']].values
y_vals = df2[['y1','y2','y3','y4','y5']].values
z_vals = df2[['z1','z2','z3','z4','z5']].values

Next we need to create a bi2Dlinter function that handles vector inputs, this involves changing linspace/meshgrid to work for an array and changing the least_squares function. Normally scipy.linalg functions work over an array but as far as I'm aware the .lstsq method doesn't. Instead we can use the .SVD to replicate the same functionality over an array.

def create_ranges(start, stop, N, endpoint=True):
    if endpoint==1:
        divisor = N-1
    else:
        divisor = N
    steps = (1.0/divisor) * (stop - start)
    return steps[:,None]*np.arange(N) + start[:,None]

def linspace_nd(x,y,gridrez):
    a1 = create_ranges(x.min(axis=1), x.max(axis=1), N=gridrez, endpoint=True)
    a2 = create_ranges(y.min(axis=1), y.max(axis=1), N=gridrez, endpoint=True)
    out_shp = a1.shape + (a2.shape[1],)
    Xout = np.broadcast_to(a1[:,None,:], out_shp)
    Yout = np.broadcast_to(a2[:,:,None], out_shp)
    return Xout, Yout

def stacked_lstsq(L, b, rcond=1e-10):
    """
    Solve L x = b, via SVD least squares cutting of small singular values
    L is an array of shape (..., M, N) and b of shape (..., M).
    Returns x of shape (..., N)
    """
    u, s, v = np.linalg.svd(L, full_matrices=False)
    s_max = s.max(axis=-1, keepdims=True)
    s_min = rcond*s_max
    inv_s = np.zeros_like(s)
    inv_s[s >= s_min] = 1/s[s>=s_min]
    x = np.einsum('...ji,...j->...i', v,
                  inv_s * np.einsum('...ji,...j->...i', u, b.conj()))
    return np.conj(x, x)

def vectorized_bi2Dlinter(x_vals, y_vals, z_vals, gridrez):

    X,Y = linspace_nd(x_vals, y_vals, gridrez)
    A = np.stack((x_vals,y_vals,np.ones_like(z_vals)), axis=2)
    C = stacked_lstsq(A, z_vals)
    n_bcast = C.shape[0]
    return C.T[0].reshape((n_bcast,1,1))*X + C.T[1].reshape((n_bcast,1,1))*Y + C.T[2].reshape((n_bcast,1,1))

Upon testing this on data for n=10000 rows, the vectorized function was significantly faster.

%%timeit
ZZ = []
for index, row in df2.iterrows():
    x=row['x1'], row['x2'], row['x3'], row['x4'], row['x5']
    y=row['y1'], row['y2'], row['y3'], row['y4'], row['y5']
    z=row['z1'], row['z2'], row['z3'], row['z4'], row['z5']
    ZZ.append((bi2Dlinter(x,y,z,gridrez)))
df2['ZZ']=ZZ

Out: 5.52 s ± 17.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
res = vectorized_bi2Dlinter(x_vals,y_vals,z_vals,gridrez)

Out: 74.6 ms ± 159 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

You should pay careful attention to whats going on in this vectorize function and familiarize yourself with broadcasting in numpy. I cannot take credit for the first three functions, instead I will link their answers from stack overflow for you to get an understanding.

Vectorized NumPy linspace for multiple start and stop values

how to solve many overdetermined systems of linear equations using vectorized codes?

How to use numpy.c_ properly for arrays

How to efficiently interpolate data in a Pandas DataFrame row-wise?

Question

1 answers

solution1
1 ACCPTED 2019-03-28 09:40:05

How to efficiently interpolate data in a Pandas DataFrame row-wise?

Question

1 answers

solution1 1 ACCPTED 2019-03-28 09:40:05

solution1
1 ACCPTED 2019-03-28 09:40:05