I've been trying to make a fitted curve on R, but have some issues. I am working with several large data sets which make up x and y coordinates. When plotted with ggplot's geom_point or any other plotting function, there's a trend where the plot tends to resemble the graph of a square root function.
This would be the code to make the fit using geom_smooth that I used:
plt = ggplot(data = data2, aes(x = x, y = y)) + geom_point() +geom_smooth()
And that basically gets me this:
Is there a way to make the curve more like the red square root curve (y=x^0.5) - basically make it smoother and to fit accordingly to a certain formula? This is the smallest of the data sets to serve as an example.
I've also tried fitting with the method as loess, which gives a curve close to what I want, but for data sets which are either much larger (around 500,000-700,000 points) or have certain points which are very densely packed in a certain region loess does not seem to work as well. There's a tendency that the mean is a bit skewed, which makes sense since the copious amounts of points at that region are pushing it up. But I need to fit the curve and force it into being close to the square root curve. I've also tried messing with the span values, but that didn't really affect the smoothness of the curve.
One thing that came to my mind is the following. Your best graph is probably evaluated by minimizing a chi square. You may put an additional criterium to that, ie by how much this fit deviates from a square root behaviour. This can be done by fitting the solution by sqrt()
and add a weighted chi-square to the total evaluation of the quality of your fit. Not sure how to do that R
, but in python you get something like this: The blue graph would be the best
sqrt()
fit. The yellow one is the best quadratic spline with knots at [0,0,.1,.2,.3,.4,.6,.9,.9,.9]
, ie weight=0
(you could additionally optimize the knot position, didn't do that here). Then we put increasing weight on how good the fit can be fitted by sqrt()
, weights = 0.5,1,2
, respectively.
Code is as follows:
import matplotlib
matplotlib.use('Qt4Agg')
from matplotlib import pyplot as plt
import numpy as np
from scipy.optimize import leastsq,curve_fit
###from the scipy doc page as I have scipy 0.16 and no build in BSpline, yet
def B(x, k, i, t):
if k == 0:
return 1.0 if t[i] <= x < t[i+1] else 0.0
if t[i+k] == t[i]:
c1 = 0.0
else:
c1 = (x - t[i])/(t[i+k] - t[i]) * B(x, k-1, i, t)
if t[i+k+1] == t[i+1]:
c2 = 0.0
else:
c2 = (t[i+k+1] - x)/(t[i+k+1] - t[i+1]) * B(x, k-1, i+1, t)
return c1 + c2
def bspline(x, t, c, k):
n = len(t) - k - 1
assert (n >= k+1) and (len(c) >= n)
return sum(c[i] * B(x, k, i, t) for i in range(n))
def mixed_res(params,points,weight):
[xList,yList] = zip(*points)
bSplList=[bspline(x,[0,0,.1,.2,.3,.4,.6,.9,.9,.9],params,2) for x in xList]
###standard chisq
diffTrue=[y-b for y,b in zip(yList,bSplList)]
###how good can the spline be fitted with sqrt
locfit,_=curve_fit(sqrtfunc,xList,bSplList)
sqrtList=[sqrtfunc(x,locfit[0]) for x in xList]
diffWeight=[ weight*(s-b) for s,b in zip(sqrtList,bSplList)]
return diffTrue+diffWeight
def sqrtfunc(x,a):
return a*np.sqrt(x)
xList,yList=np.loadtxt("PHOQSTACK.csv", unpack=True, delimiter=',')
xListSorted=sorted(xList)
zipData=zip(xList,yList)
fig=plt.figure(1)
ax=fig.add_subplot(1,1,1)
knotList=[0,0,.1,.2,.3,.4,.6,.9,.9,.9]
order=2
sqrtvalues,_=curve_fit(sqrtfunc,xList,yList)
th_sqrt_y=[sqrtfunc(x,sqrtvalues[0]) for x in xListSorted]
ax.scatter(xList,yList,s=1)
ax.plot(xListSorted,th_sqrt_y)
fitVals=[.2,.3,.4,.2,.3,.4,.2]
for s in [0,.5,1,2]:
print s
fitVals,ier=leastsq(mixed_res,fitVals,args=( zipData, s ) )
th_b_y=[bspline(x,knotList,fitVals,order) for x in xListSorted]
ax.plot(xListSorted,th_b_y)
plt.show()
Problem is that for large weights, the fit is more busy getting the shape to sqrt
than fitting the actual data and you might run into convergence issues.
A second option would be to directly make the sqrt
part of the fit and provide its relative contribution as part of the chi square. The blue and yellow graphs as before. The others are weigted fits with the same weights as above.
For this I changed the residual function to
def mixed_res(params,points,weight):
a=params[0]
coffs=params[1:]
[xList,yList] = zip(*points)
sqrtList=[a*np.sqrt(x) for x in xList]
bSplList=[bspline(x,[0,0,.1,.2,.3,.4,.6,.9,.9,.9],coffs,2) for x in xList]
diffTrue=[y-s-b for y,s,b in zip(yList,sqrtList,bSplList)]
diffWeight=[ weight*(s-b)/(s+.001) for s,b in zip(sqrtList,bSplList)]
return diffTrue+diffWeight
and the call to fit as
fitVals=[.4]+[.2,.3,.4,.2,.3,.4,.4]
for s in [0,.5,1,2]:
print s
fitVals,ier=leastsq(mixed_res,fitVals,args=( zipData, s ) )
th_b_y=[fitVals[0]*np.sqrt(x)+bspline(x,knotList,fitVals[1:],order) for x in xListSorted]
ax.plot(xListSorted,th_b_y)
The remaining big question is: How do you decide which weighting to take? What do you mean by more like square root ?
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.