使用scipy对2d线进行插值/平滑处理时，如何提高性能？

Question

I have a moderate size data set, namely 20000 x 2 floats in a two column matrix. 我有一个中等大小的数据集，即两列矩阵中的20000 x 2浮点数。 The first column is the the x column which represents the distance to the original point along a trajectory, another column is the y column which represents the work has done to the object. 第一列是x列，代表沿轨迹到原始点的距离，另一列是y列，代表对对象完成的工作。 This data set is obtained from lab operations, so it's fairly arbitrary. 该数据集是从实验室操作中获得的，因此相当随意。 I've already turned this structure into numpy array. 我已经把这个结构变成了numpy数组。 I want to plot y vs x in a figure with a smooth curve. 我想在具有平滑曲线的图中绘制y vs x。 So I hope the following code could help me: 所以我希望以下代码能对我有所帮助：

x_smooth = np.linspace(x.min(),x.max(), 20000)
y_smooth = spline(x, y, x_smooth)
plt.plot(x_smooth, y_smooth)
plt.show()

However, when my program execute the line y_smooth = spline(x,y,x_smooth) , it takes a very long time,say 10 min, and even sometimes it will blow my memory that I have to restart my machine. 但是，当我的程序执行y_smooth = spline(x,y,x_smooth) ，它要花很长时间，例如10分钟，甚至有时它会炸毁我的内存，我必须重新启动机器。 I tried to reduce the chunk number to 200 and 2000 and none of them works. 我试图将块数减少到200和2000，但它们都不起作用。 Then I checked the official scipy reference: scipy.interpolate.spline here. 然后，我检查了官方的scipy参考： scipy.interpolate.spline 。 And they said that spline is deprecated in v 0.19, but I'm not using the new version. 他们说在0.19版中已弃用spline线，但我没有使用新版本。 If spline is deprecated for quite a bit of the time, how to use the equivalent Bspline now? 如果不建议使用spline线很多时间，现在如何使用等效的Bspline ？ If spline is still functioning, then what causes the slow performance 如果spline线仍然起作用，那么是什么原因导致性能下降

One portion of my data could look like this: 我的数据的一部分可能看起来像这样：

13.202      0.0
13.234738      -0.051354643759
12.999116      0.144464320836
12.86252      0.07396528119
13.1157      0.10019738758
13.357109      -0.30288563381
13.234004      -0.045792536285
12.836279      0.0362257166275
12.851597      0.0542649286915
13.110691      0.105297378401
13.220619      -0.0182963209185
13.092143      0.116647353635
12.545676      -0.641112204849
12.728248      -0.147460703493
12.874176      0.0755861585235
12.746764      -0.111583725833
13.024995      0.148079528382
13.106033      0.119481137144
13.327233      -0.197666132456
13.142423      0.0901867159545

Answer 1

Several issues here. 这里有几个问题。 First and foremost, spline fitting you're trying to use is global. 首先，您要使用的样条拟合是全局的。 This means that you're solving a system of linear equations of the size 20000 at the construction time (evaluations are weakly sensitive to the dataset size though). 这意味着您在构建时正在求解大小为20000的线性方程组（尽管评估对数据集的大小非常敏感）。 This explains why the spline construction is slow. 这解释了为什么样条构建缓慢。

scipy.interpolate.spline , furthermore, does linear algebra with full matrices --- hence memory consumption. 此外， scipy.interpolate.spline做具有完整矩阵的线性代数---因此占用了内存。 This is precisely why it's deprecated from scipy 0.19.0 on. 这就是为什么从scipy 0.19.0开始不推荐使用的原因。

The recommended replacement, available in scipy 0.19.0, is the BSpline / make_interp_spline combo: 推荐的替代品是BSpline / make_interp_spline组合，可在scipy 0.19.0中使用：

>>> spl = make_interp_spline(x, y, k=3)    # returns a BSpline object
>>> y_new = spl(x_new)                     # evaluate

Notice it is not BSpline(x, y, k) : BSpline objects do not know anything about the data or fitting or interpolation. 注意它不是 BSpline(x, y, k) ：BSpline对象对数据，拟合或插值一无所知。

If you are using older scipy versions, your options are: 如果您使用的是较旧的scipy版本，则可以选择：

CubicSpline(x, y) for cubic splines 三次样条CubicSpline(x, y)
splrep(x, y, s=0) / splev combo. splrep(x, y, s=0) / splev组合。

However, you may want to think if you really need twice continuously differentiable functions. 但是，您可能需要考虑是否真的需要两次连续微分的功能。 If only once differentiable functions are smooth enough for your purposes, then you can use local spline interpolations, eg Akima1DInterpolator or PchipInterpolator : 如果只有一次可微函数足以满足您的目的，那么您可以使用局部样条插值，例如Akima1DInterpolator或PchipInterpolator ：

In [1]: import numpy as np

In [2]: from scipy.interpolate import pchip, splmake

In [3]: x = np.arange(1000)

In [4]: y = x**2

In [5]: %timeit pchip(x, y)
10 loops, best of 3: 58.9 ms per loop

In [6]: %timeit splmake(x, y)    
1 loop, best of 3: 5.01 s per loop

Here splmake is what spline uses under the hood, and it's also deprecated. splmake是spline在splmake使用的东西，它也已弃用。

Answer 2

Most interpolation methods in SciPy are function-generating, ie they return function which you can then execute on your x data. SciPy中的大多数插值方法都是函数生成的，即它们返回可以在x数据上执行的函数。 For example, using CubicSpline method, which connects all points with pointwise cubic spline would be 例如，使用CubicSpline方法，该方法将所有点与逐点三次样条曲线相连

from scipy.interpolate import CubicSpline

spline = CubicSpline(x, y)
y_smooth = spline(x_smooth)

Based on your description I think that you correctly want to use BSpline. 根据您的描述，我认为您正确地希望使用BSpline。 To do so, follow the pattern above, ie 为此，请遵循上面的模式，即

from scipy.interpolate import BSpline

order = 2 # smoothness order
spline = BSpline(x, y, order)
y_smooth = spline(x_smooth)

Since you have such amount of data, it probably must be very noisy. 由于您拥有如此大量的数据，因此它可能必须非常嘈杂。 I'd suggest using bigger spline order, which relates to the number of knots used for interpolation. 我建议使用更大的样条顺序，这与用于插值的结数有关。

In both cases, your knots, ie x and y , should be sorted. 在这两种情况下，您的结，即x和y ，都应进行排序。 These are 1D interpolation (since you are using only x_smooth as input). 这些是一维插值（因为仅使用x_smooth作为输入）。 You can sort them using np.argsort . 您可以使用np.argsort对它们进行np.argsort 。 In short: 简而言之：

from scipy.interpolate import BSpline

sort_idx = np.argsort(x)
x_sorted = x[sort_idx]
y_sorted = y[sort_idx]

order = 20 # smoothness order
spline = BSpline(x_sorted, y_sorted, order)
y_smooth = spline(x_smooth)

plt.plot(x_sorted, y_sorted, '.')
plt.plot(x_smooth, y_smooth, '-')
plt.show()

Answer 3

My problem can be generalize to how to smoothly plot 2d graphs when data points are randomized. 我的问题可以概括为当数据点随机化时如何平滑绘制2d图形。 Since you are only dealing with two columns of data, if you sort your data by independent variable, at least your data points will be connected in order, and that's how matplotlib connects your data points. 由于您只处理两列数据，因此，如果按自变量对数据进行排序，则至少将按顺序连接数据点，这就是matplotlib连接数据点的方式。

@Dawid Laszuk has provided one solution to sort data by independent variable, and I'll display mine here: @Dawid Laszuk提供了一种解决方案，可以按自变量对数据进行排序，我将在这里显示我的方法：

plotting_columns = []
    for i in range(len(x)):
        plotting_columns.append(np.array([x[i],y[i]]))
    plotting_columns.sort(key=lambda pair : pair[0])
    plotting_columns = np.array(plotting_columns)

traditional sort() by filter condition could also do the sorting job efficient here. 传统的按过滤条件的sort()在这里也可以有效地完成排序工作。

But it's just your first step. 但这只是您的第一步。 The following steps are not hard either, to smooth your graph, you also want to keep your independent variable in linear ascending order with identical step interval, so 以下步骤也不难，为使图形平滑，您还希望将自变量以相同的步长间隔保持线性升序，因此

x_smooth = np.linspace(x.min(), x.max(), num_steps)

is enough to do the job. 足以胜任这项工作。 Usually, if you have plenty of data points, for example, more than 10000 points (correctness and accuracy are not human verifiable), you just want to plot the significant points to display the trend, then only smoothing x is enough. 通常，如果您有大量数据点，例如，超过10000个点（正确性和准确性是无法人工验证的），则只想绘制重要点以显示趋势，则仅平滑x就足够了。 So you can plt.plot(x_smooth,y) simply. 因此，您可以简单地plt.plot(x_smooth,y) 。

You will notice that x_smooth will generate many x values that will not have corresponding y value. 您会注意到x_smooth将生成许多x值，而这些x值将没有对应的y值。 When you want to maintain the correctness, you need to use line fitting functions. 如果要保持正确性，则需要使用线拟合功能。 As @ev-br demonstrated in his answer, spline functions are expensive on purpose. 正如@ ev-br在他的回答中所表明的那样， spline函数的目的是昂贵的。 Therefore you might want to do some simpler trick. 因此，您可能想做一些简单的技巧。 I smoothed my graph without using those functions. 我在不使用这些函数的情况下平滑了图表。 And you have some simple steps to it. 您只需执行一些简单的步骤。

First, round your values so that your data will not vary too much in small intervals. 首先，四舍五入您的值，以使您的数据在很小的间隔内不会有太大变化。 (You can skip this step) You can change one line when you constructing the plotting_columns as: （您可以跳过此步骤）将plotting_columns构造为以下内容时，可以更改一行：

plotting_columns.append(np.around(np.array(x[i],y[i]), decimal=4))

After done this, you can filter out the point that you don't want to plot by choosing the points close to the x_smooth values: 完成此操作后，可以通过选择接近x_smooth值的点来过滤掉不想绘制的点：

new_plots = []
for i in range(len(x_smooth)):
    if plotting_columns[:,0][i] >= x_smooth[i] - error and plotting_columns[:,0][i]< x_smooth[i] + error:
        new_plots.append(plotting_columns[i])
    else:
        # Remove all points between the interval #

This is how I solved my problems. 这就是我解决问题的方式。

使用scipy对2d线进行插值/平滑处理时，如何提高性能？

问题描述

3 个解决方案

解决方案1
3 已采纳 2017-04-27 06:06:51

解决方案2
1 2017-04-26 22:34:51

解决方案3
0 2017-04-28 16:39:45

使用scipy对2d线进行插值/平滑处理时，如何提高性能？

问题描述

3 个解决方案

解决方案1 3 已采纳 2017-04-27 06:06:51

解决方案2 1 2017-04-26 22:34:51

解决方案3 0 2017-04-28 16:39:45

解决方案1
3 已采纳 2017-04-27 06:06:51

解决方案2
1 2017-04-26 22:34:51

解决方案3
0 2017-04-28 16:39:45