简体   繁体   English

如何在 R 中为我的数据拟合平滑曲线?

[英]How to fit a smooth curve to my data in R?

I'm trying to draw a smooth curve in R .我正在尝试在R中绘制平滑曲线。 I have the following simple toy data:我有以下简单的玩具数据:

> x
 [1]  1  2  3  4  5  6  7  8  9 10
> y
 [1]  2  4  6  8  7 12 14 16 18 20

Now when I plot it with a standard command it looks bumpy and edgy, of course:现在,当我使用标准命令 plot 它看起来很颠簸和前卫时,当然:

> plot(x,y, type='l', lwd=2, col='red')

How can I make the curve smooth so that the 3 edges are rounded using estimated values?如何使曲线平滑,以便使用估计值对 3 个边缘进行舍入? I know there are many methods to fit a smooth curve but I'm not sure which one would be most appropriate for this type of curve and how you would write it in R .我知道有很多方法可以拟合平滑曲线,但我不确定哪种方法最适合这种类型的曲线,以及如何在R中编写它。

I like loess() a lot for smoothing: 我喜欢loess()很多用于平滑:

x <- 1:10
y <- c(2,4,6,8,7,12,14,16,18,20)
lo <- loess(y~x)
plot(x,y)
lines(predict(lo), col='red', lwd=2)

Venables and Ripley's MASS book has an entire section on smoothing that also covers splines and polynomials -- but loess() is just about everybody's favourite. Venables和Ripley的MASS书中有关于平滑的整个部分,也包括样条和多项式 - 但是loess()几乎是每个人的最爱。

Maybe smooth.spline is an option, You can set a smoothing parameter (typically between 0 and 1) here 也许smooth.spline是一个选项,你可以在这里设置一个平滑参数(通常在0和1之间)

smoothingSpline = smooth.spline(x, y, spar=0.35)
plot(x,y)
lines(smoothingSpline)

you can also use predict on smooth.spline objects. 你也可以在smooth.spline对象上使用predict。 The function comes with base R, see ?smooth.spline for details. 该功能附带基础R,详情请见?smooth.spline。

In order to get it REALLY smoooth... 为了得到真正的smoooth ......

x <- 1:10
y <- c(2,4,6,8,7,8,14,16,18,20)
lo <- loess(y~x)
plot(x,y)
xl <- seq(min(x),max(x), (max(x) - min(x))/1000)
lines(xl, predict(lo,xl), col='red', lwd=2)

This style interpolates lots of extra points and gets you a curve that is very smooth. 这种风格可以插入许多额外的点,并为您提供非常平滑的曲线。 It also appears to be the the approach that ggplot takes. 它似乎也是ggplot采用的方法。 If the standard level of smoothness is fine you can just use. 如果标准水平的平滑度很好,你可以使用。

scatter.smooth(x, y)

the qplot() function in the ggplot2 package is very simple to use and provides an elegant solution that includes confidence bands. ggplot2包中的qplot()函数使用起来非常简单,并提供了一个包含置信带的优雅解决方案。 For instance, 例如,

qplot(x,y, geom='smooth', span =0.5)

produces 产生 在此输入图像描述

LOESS is a very good approach, as Dirk said. 德克说,黄土是一种非常好的方法。

Another option is using Bezier splines, which may in some cases work better than LOESS if you don't have many data points. 另一个选择是使用Bezier样条曲线,如果没有很多数据点,在某些情况下可能比LOESS更好。

Here you'll find an example: http://rosettacode.org/wiki/Cubic_bezier_curves#R 在这里你可以找到一个例子: http//rosettacode.org/wiki/Cubic_bezier_curves#R

# x, y: the x and y coordinates of the hull points
# n: the number of points in the curve.
bezierCurve <- function(x, y, n=10)
    {
    outx <- NULL
    outy <- NULL

    i <- 1
    for (t in seq(0, 1, length.out=n))
        {
        b <- bez(x, y, t)
        outx[i] <- b$x
        outy[i] <- b$y

        i <- i+1
        }

    return (list(x=outx, y=outy))
    }

bez <- function(x, y, t)
    {
    outx <- 0
    outy <- 0
    n <- length(x)-1
    for (i in 0:n)
        {
        outx <- outx + choose(n, i)*((1-t)^(n-i))*t^i*x[i+1]
        outy <- outy + choose(n, i)*((1-t)^(n-i))*t^i*y[i+1]
        }

    return (list(x=outx, y=outy))
    }

# Example usage
x <- c(4,6,4,5,6,7)
y <- 1:6
plot(x, y, "o", pch=20)
points(bezierCurve(x,y,20), type="l", col="red")

The other answers are all good approaches. 其他答案都是好方法。 However, there are a few other options in R that haven't been mentioned, including lowess and approx , which may give better fits or faster performance. 但是,R中还有一些未提及的其他选项,包括lowessapprox ,这可能会提供更好的拟合或更快的性能。

The advantages are more easily demonstrated with an alternate dataset: 使用备用数据集可以更轻松地证明其优势:

sigmoid <- function(x)
{
  y<-1/(1+exp(-.15*(x-100)))
  return(y)
}

dat<-data.frame(x=rnorm(5000)*30+100)
dat$y<-as.numeric(as.logical(round(sigmoid(dat$x)+rnorm(5000)*.3,0)))

Here is the data overlaid with the sigmoid curve that generated it: 这是用生成它的sigmoid曲线覆盖的数据:

数据

This sort of data is common when looking at a binary behavior among a population. 在查看总体中的二元行为时,这种数据很常见。 For example, this might be a plot of whether or not a customer purchased something (a binary 1/0 on the y-axis) versus the amount of time they spent on the site (x-axis). 例如,这可能是客户是否购买了某些东西(y轴上的二进制1/0)与他们在网站上花费的时间(x轴)的关系图。

A large number of points are used to better demonstrate the performance differences of these functions. 大量的点用于更好地展示这些功能的性能差异。

Smooth , spline , and smooth.spline all produce gibberish on a dataset like this with any set of parameters I have tried, perhaps due to their tendency to map to every point, which does not work for noisy data. 使用我尝试的任何参数集, Smoothsplinesmooth.spline都会在这样的数据集上产生乱码,可能是因为它们倾向于映射到每个点,这对于噪声数据不起作用。

The loess , lowess , and approx functions all produce usable results, although just barely for approx . loesslowessapprox函数都可以产生可用的结果,尽管只是lowess approx This is the code for each using lightly optimized parameters: 这是每个使用轻微优化参数的代码:

loessFit <- loess(y~x, dat, span = 0.6)
loessFit <- data.frame(x=loessFit$x,y=loessFit$fitted)
loessFit <- loessFit[order(loessFit$x),]

approxFit <- approx(dat,n = 15)

lowessFit <-data.frame(lowess(dat,f = .6,iter=1))

And the results: 结果如下:

plot(dat,col='gray')
curve(sigmoid,0,200,add=TRUE,col='blue',)
lines(lowessFit,col='red')
lines(loessFit,col='green')
lines(approxFit,col='purple')
legend(150,.6,
       legend=c("Sigmoid","Loess","Lowess",'Approx'),
       lty=c(1,1),
       lwd=c(2.5,2.5),col=c("blue","green","red","purple"))

适合

As you can see, lowess produces a near perfect fit to the original generating curve. 如您所见, lowess产生与原始生成曲线近似完美的拟合。 Loess is close, but experiences a strange deviation at both tails. Loess很接近,但两条尾巴经历了一个奇怪的偏差。

Although your dataset will be very different, I have found that other datasets perform similarly, with both loess and lowess capable of producing good results. 虽然您的数据集将非常不同,但我发现其他数据集的表现相似, loesslowess都能产生良好的结果。 The differences become more significant when you look at benchmarks: 当您查看基准时,差异变得更加显着:

> microbenchmark::microbenchmark(loess(y~x, dat, span = 0.6),approx(dat,n = 20),lowess(dat,f = .6,iter=1),times=20)
Unit: milliseconds
                           expr        min         lq       mean     median        uq        max neval cld
  loess(y ~ x, dat, span = 0.6) 153.034810 154.450750 156.794257 156.004357 159.23183 163.117746    20   c
            approx(dat, n = 20)   1.297685   1.346773   1.689133   1.441823   1.86018   4.281735    20 a  
 lowess(dat, f = 0.6, iter = 1)   9.637583  10.085613  11.270911  11.350722  12.33046  12.495343    20  b 

Loess is extremely slow, taking 100x as long as approx . Loess非常慢, approx需要100倍。 Lowess produces better results than approx , while still running fairly quickly (15x faster than loess). Lowessapprox产生更好的结果,同时仍然运行得相当快(比黄土快15倍)。

Loess also becomes increasingly bogged down as the number of points increases, becoming unusable around 50,000. 随着点数的增加, Loess也越来越陷入困境,大约在50,000点左右无法使用。

EDIT: Additional research shows that loess gives better fits for certain datasets. 编辑:其他研究表明, loess更好地适应某些数据集。 If you are dealing with a small dataset or performance is not a consideration, try both functions and compare the results. 如果您正在处理小型数据集或性能不是考虑因素,请尝试两种功能并比较结果。

In ggplot2 you can do smooths in a number of ways, for example: 在ggplot2中,您可以通过多种方式进行平滑处理,例如:

library(ggplot2)
ggplot(mtcars, aes(wt, mpg)) + geom_point() +
  geom_smooth(method = "gam", formula = y ~ poly(x, 2)) 
ggplot(mtcars, aes(wt, mpg)) + geom_point() +
  geom_smooth(method = "loess", span = 0.3, se = FALSE) 

在此输入图像描述 在此输入图像描述

I didn't see this method shown, so if someone else is looking to do this I found that ggplot documentation suggested a technique for using the gam method that produced similar results to loess when working with small data sets. 我没有看到这个方法显示,所以如果其他人想要这样做,我发现ggplot文档提出了一种使用gam方法的技术,当使用小数据集时,该方法产生与loess相似的结果。

library(ggplot2)
x <- 1:10
y <- c(2,4,6,8,7,8,14,16,18,20)

df <- data.frame(x,y)
r <- ggplot(df, aes(x = x, y = y)) + geom_smooth(method = "gam", formula = y ~ s(x, bs = "cs"))+geom_point()
r

First with the loess method and auto formula Second with the gam method with suggested formula 首先使用黄土方法和自动公式 第二个使用带有建议公式的gam方法

Another option is using the ggscatter function from the ggpubr package.另一种选择是使用来自ggpubr package 的ggpubr function。 By specifying add="loess" , you will get a smoothed line through your data.通过指定add="loess" ,您将在数据中获得一条平滑线。 In the link above you can find more possibilities with this function.在上面的链接中,您可以使用此 function 找到更多可能性。 Here is a reproducible example using the mtcars dataset:这是使用mtcars数据集的可重现示例:

library(ggpubr)
ggscatter(data = mtcars,
          x = "wt",
          y = "mpg",
          add = "loess",
          conf.int = TRUE)
#> `geom_smooth()` using formula 'y ~ x'

Created on 2022-08-28 with reprex v2.0.2使用reprex v2.0.2创建于 2022-08-28

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM