一次性完成curve_fit的多次迭代以获得分段函数

Question

我正在尝试同时执行Scipy的curve_fit多次迭代，以避免循环并因此提高速度。

这与这个问题非常相似，已经解决了。 然而，功能是分段（不连续）的事实使得该解决方案不适用于此。

考虑这个例子：

import numpy as np
from numpy import random as rng
from scipy.optimize import curve_fit
rng.seed(0)
N=20
X=np.logspace(-1,1,N)
Y = np.zeros((4, N))
for i in range(0,4):
    b = i+1
    a = b
    print(a,b)
    Y[i] = (X/b)**(-a) #+ 0.01 * rng.randn(6)
    Y[i, X>b] = 1

这产生了这些数组：

您可以看到在X==b处不连续。 我可以迭代地使用curve_fit来检索a和b的原始值：

def plaw(r, a, b):
    """ Theoretical power law for the shape of the normalized conditional density """
    import numpy as np
    return np.piecewise(r, [r < b, r >= b], [lambda x: (x/b)**-a, lambda x: 1])


coeffs=[]
for ix in range(Y.shape[0]):
    print(ix)
    c0, pcov = curve_fit(plaw, X, Y[ix])
    coeffs.append(c0)

但是这个过程可能会非常慢，这取决于X ， Y和循环的大小，所以我试图通过尝试在不需要循环的情况下获得coeffs来加快速度。 到目前为止，我没有运气。

可能很重要的事情：

X和Y仅包含正值
a和b总是积极的
尽管适合于该示例的数据是平滑的（为了简单起见），但实际数据具有噪声

编辑

这是我得到的：

y=np.ma.masked_where(Y<1.01, Y)

lX = np.log(X)
lY = np.log(y)
A = np.vstack([lX, np.ones(len(lX))]).T
m,c=np.linalg.lstsq(A, lY.T)[0]

print('a=',-m)
print('b=',np.exp(-c/m))

但即使没有任何噪音，输出也是：

a= [0.18978965578339158 1.1353633705997466 2.220234483915197 3.3324502660995714]
b= [339.4090881838179 7.95073481873057 6.296592007396107 6.402567167503574]

这比我希望得到的更糟糕。

Answer 1

以下是加快这种情况的三种方法。 你没有提供所需的加速或准确度，甚至矢量大小，所以买家要小心。

TL; DR

时序：

len       1      2      3      4
1000    0.045  0.033  0.025  0.022
10000   0.290  0.097  0.029  0.023
100000  3.429  0.767  0.083  0.030
1000000               0.546  0.046

1) Original Method
2) Pre-estimate with Subset
3) M Newville [linear log-log estimate](https://stackoverflow.com/a/44975066/7311767)
4) Subset Estimate (Use Less Data)

使用子集进行预估（方法2）：

只需运行curve_fit两次就可以实现一个不错的加速，其中第一次使用短的数据子集来快速估算。 然后使用该估计来为整个数据集播种curve_fit 。

x, y = current_data
stride = int(max(1, len(x) / 200))
c0 = curve_fit(power_law, x[0:len(x):stride], y[0:len(y):stride])[0]
return curve_fit(power_law, x, y, p0=c0)[0]

M Newville 线性对数 - 对数估计（方法3）：

使用M Newville提出的对数估计值也要快得多。 由于OP关注Newville提出的初始估计方法，该方法使用带有子集的curve_fit来提供曲线中断点的估计。

x, y = current_data
stride = int(max(1, len(x) / 200))
c0 = curve_fit(power_law, x[0:len(x):stride], y[0:len(y):stride])[0]

index_max = np.where(x > c0[1])[0][0]
log_x = np.log(x[:index_max])
log_y = np.log(y[:index_max])
result = linregress(log_x, log_y)
return -result[0], np.exp(-result[1] / result[0])
return (m, c), result

使用更少的数据（方法4）：

最后，用于前两种方法的种子机制提供了对样本数据的非常好的估计。 当然这是样本数据，因此您的里程可能会有所不同。

stride = int(max(1, len(x) / 200))
c0 = curve_fit(power_law, x[0:len(x):stride], y[0:len(y):stride])[0]

测试代码：

import numpy as np
from numpy import random as rng
from scipy.optimize import curve_fit
from scipy.stats import linregress

fit_data = {}
current_data = None

def data_for_fit(a, b, n):
    key = a, b, n
    if key not in fit_data:
        rng.seed(0)
        x = np.logspace(-1, 1, n)
        y = np.clip((x / b) ** (-a) + 0.01 * rng.randn(n), 0.001, None)
        y[x > b] = 1
        fit_data[key] = x, y
    return fit_data[key]


def power_law(r, a, b):
    """ Power law for the shape of the normalized conditional density """
    import numpy as np
    return np.piecewise(
        r, [r < b, r >= b], [lambda x: (x/b)**-a, lambda x: 1])

def method1():
    x, y = current_data
    return curve_fit(power_law, x, y)[0]

def method2():
    x, y = current_data
    return curve_fit(power_law, x, y, p0=method4()[0])

def method3():
    x, y = current_data
    c0, pcov = method4()

    index_max = np.where(x > c0[1])[0][0]
    log_x = np.log(x[:index_max])
    log_y = np.log(y[:index_max])
    result = linregress(log_x, log_y)
    m, c = -result[0], np.exp(-result[1] / result[0])
    return (m, c), result

def method4():
    x, y = current_data
    stride = int(max(1, len(x) / 200))
    return curve_fit(power_law, x[0:len(x):stride], y[0:len(y):stride])

from timeit import timeit

def runit(stmt):
    print("%s: %.3f  %s" % (
        stmt, timeit(stmt + '()', number=10,
                     setup='from __main__ import ' + stmt),
        eval(stmt + '()')[0]
    ))

def runit_size(size):

    print('Length: %d' % size)
    if size <= 100000:
        runit('method1')
        runit('method2')
    runit('method3')
    runit('method4')


for i in (1000, 10000, 100000, 1000000):
    current_data = data_for_fit(3, 3, i)
    runit_size(i)

Answer 2

两个建议：

使用numpy.where （以及可能argmin ）找到X ，在其值Y数据变为1，或者只是略微大于1，并截断该数据到该点-有效地忽略该数据，其中Y = 1。

这可能是这样的：

index_max = numpy.where(y < 1.2)[0][0]
x = y[:index_max]
y = y[:index_max]

使用对数 - 对数图中显示的提示，电源定律现在在log-log中是线性的 。 您不需要curve_fit ，但可以在log(Y) vs log(Y)上使用scipy.stats.linregress 。 对于您的实际工作，这至少会为后续拟合提供良好的起始值。

跟进此问题并尝试关注您的问题，您可以尝试以下方法：

import numpy as np 
from scipy.stats import linregress

np.random.seed(0)
npts = 51 
x = np.logspace(-2, 2, npts)
YTHRESH = 1.02

for i in range(5):
    b = i + 1.0 + np.random.normal(scale=0.1)
    a = b + np.random.random()
    y = (x/b)**(-a) + np.random.normal(scale=0.0030, size=npts)
    y[x>b] = 1.0

    # to model exponential decay, first remove the values
    # where y ~= 1 where the data is known to not decay...
    imax = np.where(y < YTHRESH)[0][0]

    # take log of this truncated x and y
    _x = np.log(x[:imax])
    _y = np.log(y[:imax])

    # use linear regression on the log-log data:
    out = linregress(_x, _y)

    # map slope/intercept to scale, exponent
    afit = -out.slope
    bfit = np.exp(out.intercept/afit)

    print(""" === Fit Example {i:3d}
          a  expected {a:4f}, got {afit:4f}
          b  expected {b:4f}, got {bfit:4f}
          """.format(i=i+1, a=a, b=b, afit=afit, bfit=bfit))

希望这足以让你前进。

一次性完成curve_fit的多次迭代以获得分段函数

问题描述

2 个解决方案

解决方案1
2 2017-07-21 02:46:38

TL; DR

使用子集进行预估（方法2）：

M Newville 线性对数 - 对数估计（方法3）：

使用更少的数据（方法4）：

测试代码：

解决方案2
1 2017-07-07 15:53:35

一次性完成curve_fit的多次迭代以获得分段函数

问题描述

2 个解决方案

解决方案1 2 2017-07-21 02:46:38

TL; DR

使用子集进行预估（方法2）：

M Newville 线性对数 - 对数估计 （方法3）：

使用更少的数据（方法4）：

测试代码：

解决方案2 1 2017-07-07 15:53:35

解决方案1
2 2017-07-21 02:46:38

M Newville 线性对数 - 对数估计（方法3）：

解决方案2
1 2017-07-07 15:53:35