简体   繁体   English


[英]Python: two-curve gaussian fitting with non-linear least-squares

My knowledge of maths is limited which is why I am probably stuck. 我对数学的了解有限,这就是我可能被困的原因。 I have a spectra to which I am trying to fit two Gaussian peaks. 我有一个光谱,我试图适应两个高斯峰。 I can fit to the largest peak, but I cannot fit to the smallest peak. 我可以适应最大的峰值,但我无法适应最小的峰值。 I understand that I need to sum the Gaussian function for the two peaks but I do not know where I have gone wrong. 我知道我需要对两个峰值的高斯函数求和,但我不知道哪里出错了。 An image of my current output is shown: 显示当前输出的图像:


The blue line is my data and the green line is my current fit. 蓝线是我的数据,绿线是我目前适合的。 There is a shoulder to the left of the main peak in my data which I am currently trying to fit, using the following code: 在我的数据中主峰左侧有一个肩膀,我目前正在尝试使用以下代码:

import matplotlib.pyplot as pt
import numpy as np
from scipy.optimize import leastsq
from pylab import *

time = []
counts = []

for i in open('/some/folder/to/file.txt', 'r'):
    segs = i.split()

time_array = arange(len(time), dtype=float)
counts_array = arange(len(counts))
time_array[0:] = time
counts_array[0:] = counts

def model(time_array0, coeffs0):
    a = coeffs0[0] + coeffs0[1] * np.exp( - ((time_array0-coeffs0[2])/coeffs0[3])**2 )
    b = coeffs0[4] + coeffs0[5] * np.exp( - ((time_array0-coeffs0[6])/coeffs0[7])**2 ) 
    c = a+b
    return c

def residuals(coeffs, counts_array, time_array):
    return counts_array - model(time_array, coeffs)

# 0 = baseline, 1 = amplitude, 2 = centre, 3 = width
peak1 = np.array([0,6337,16.2,4.47,0,2300,13.5,2], dtype=float)
#peak2 = np.array([0,2300,13.5,2], dtype=float)

x, flag = leastsq(residuals, peak1, args=(counts_array, time_array))
#z, flag = leastsq(residuals, peak2, args=(counts_array, time_array))

plt.plot(time_array, counts_array)
plt.plot(time_array, model(time_array, x), color = 'g') 
#plt.plot(time_array, model(time_array, z), color = 'r')

This code worked for me providing that you are only fitting a function that is a combination of two Gaussian distributions. 这段代码对我有用,前提是你只是拟合一个两个高斯分布组合的函数。

I just made a residuals function that adds two Gaussian functions and then subtracts them from the real data. 我刚刚创建了一个残差函数,它增加了两个高斯函数,然后从实际数据中减去它们。

The parameters (p) that I passed to Numpy's least squares function include: the mean of the first Gaussian function (m), the difference in the mean from the first and second Gaussian functions (dm, ie the horizontal shift), the standard deviation of the first (sd1), and the standard deviation of the second (sd2). 我传给Numpy最小二乘函数的参数(p)包括:第一高斯函数的平均值(m),第一和第二高斯函数的平均值差(dm,即水平位移),标准差第一个(sd1)和第二个(sd2)的标准偏差。

import numpy as np
from scipy.optimize import leastsq
import matplotlib.pyplot as plt

# Setting up test data
def norm(x, mean, sd):
  norm = []
  for i in range(x.size):
    norm += [1.0/(sd*np.sqrt(2*np.pi))*np.exp(-(x[i] - mean)**2/(2*sd**2))]
  return np.array(norm)

mean1, mean2 = 0, -2
std1, std2 = 0.5, 1 

x = np.linspace(-20, 20, 500)
y_real = norm(x, mean1, std1) + norm(x, mean2, std2)

# Solving
m, dm, sd1, sd2 = [5, 10, 1, 1]
p = [m, dm, sd1, sd2] # Initial guesses for leastsq
y_init = norm(x, m, sd1) + norm(x, m + dm, sd2) # For final comparison plot

def res(p, y, x):
  m, dm, sd1, sd2 = p
  m1 = m
  m2 = m1 + dm
  y_fit = norm(x, m1, sd1) + norm(x, m2, sd2)
  err = y - y_fit
  return err

plsq = leastsq(res, p, args = (y_real, x))

y_est = norm(x, plsq[0][0], plsq[0][2]) + norm(x, plsq[0][0] + plsq[0][1], plsq[0][3])

plt.plot(x, y_real, label='Real Data')
plt.plot(x, y_init, 'r.', label='Starting Guess')
plt.plot(x, y_est, 'g.', label='Fitted')


You can use Gaussian mixture models from scikit-learn : 您可以使用scikit-learn中的高斯混合模型:

from sklearn import mixture
import matplotlib.pyplot
import matplotlib.mlab
import numpy as np
clf = mixture.GMM(n_components=2, covariance_type='full')
m1, m2 = clf.means_
w1, w2 = clf.weights_
c1, c2 = clf.covars_
histdist = matplotlib.pyplot.hist(yourdata, 100, normed=True)
plotgauss1 = lambda x: plot(x,w1*matplotlib.mlab.normpdf(x,m1,np.sqrt(c1))[0], linewidth=3)
plotgauss2 = lambda x: plot(x,w2*matplotlib.mlab.normpdf(x,m2,np.sqrt(c2))[0], linewidth=3)


You can also use the function below to fit the number of Gaussian you want with ncomp parameter: 您还可以使用下面的函数来使用ncomp参数拟合所需的高斯数:

from sklearn import mixture

def fit_mixture(data, ncomp=2, doplot=False):
    clf = mixture.GMM(n_components=ncomp, covariance_type='full')
    ml = clf.means_
    wl = clf.weights_
    cl = clf.covars_
    ms = [m[0] for m in ml]
    cs = [numpy.sqrt(c[0][0]) for c in cl]
    ws = [w for w in wl]
    if doplot == True:
        histo = hist(data, 200, normed=True)
        for w, m, c in zip(ws, ms, cs):
            plot(histo[1],w*matplotlib.mlab.normpdf(histo[1],m,np.sqrt(c)), linewidth=3)
    return ms, cs, ws

coeffs 0 and 4 are degenerate - there is absolutely nothing in the data that can decide between them. coeffs 0和4是退化的 - 数据中绝对没有任何东西可以决定它们之间。 you should use a single zero level parameter instead of two (ie remove one of them from your code). 你应该使用一个零级参数而不是两个(即从你的代码中删除其中一个)。 this is probably what is stopping your fit (ignore the comments here saying this is not possible - there are clearly at least two peaks in that data and you should certainly be able to fit to that). 这可能是什么阻止你的健康(忽略这里的评论说这是不可能的 - 在这些数据中显然至少有两个峰值,你当然应该能够适应这一点)。

(it may not be clear why i am suggesting this, but what is happening is that coeffs 0 and 4 can cancel each other out. they can both be zero, or one could be 100 and the other -100 - either way, the fit is just as good. this "confuses" the fitting routine, which spends its time trying to work out what they should be, when there is no single right answer, because whatever value one is, the other can just be the negative of that, and the fit will be the same). (我可能不清楚为什么我这样做,但是发生的事情是,系数0和4可以相互抵消。它们都可以为零,或者一个可以是100而另一个是-100 - 无论哪种方式,适合同样好的。这会“混淆”拟合程序,当没有单一的正确答案时,花费时间试图找出它们应该是什么,因为无论一个是什么价值,另一个可能只是负面的,和拟合将是相同的)。

in fact, from the plot, it looks like there may be no need for a zero level at all. 事实上,从情节来看,似乎根本不需要零水平。 i would try dropping both of those and seeing how the fit looks. 我会试着放下这两个并看看合适的样子。

also, there is no need to fit coeffs 1 and 5 (or the zero point) in the least squares. 此外,不需要在最小二乘法中拟合系数1和5(或零点)。 instead, because the model is linear in those you could calculate their values each loop. 相反,因为模型是线性的,你可以计算每个循环的值。 this will make things faster, but is not critical. 这会让事情变得更快,但并不重要。 i just noticed you say your maths is not so good, so probably ignore this one. 我只是注意到你说你的数学不太好,所以可能会忽略这一点。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM