简体   繁体   English

使用kde2d(R)和ksdensity2d(Matlab)生成的2D KDE的差异

[英]Difference in 2D KDE produced using kde2d (R) and ksdensity2d (Matlab)

While trying to port some code from Matlab to RI have run into a problem. 尝试将一些代码从Matlab移植到RI时遇到了问题。 The gist of the code is to produce a 2D kernel density estimate and then do some simple calculations using the estimate. 代码的要点是产生2D核密度估计,然后使用估计进行一些简单的计算。 In Matlab the KDE calculation was done using the function ksdensity2d.m . 在Matlab中,使用函数ksdensity2d.m完成KDE计算。 In R the KDE calculation is done with kde2d from the MASS package. 在R中,KDE计算使用MASS包中的kde2d完成。 So lets say I want to calculate the KDE and just add the values (this is not what i intend to do, but it serves this purpose). 所以我想说我想计算KDE并只是添加值(这不是我打算做的,但它可以达到这个目的)。 In R this can be done by 在R中,这可以通过

    library(MASS)
    set.seed(1009)
    x <- sample(seq(1000, 2000), 100, replace=TRUE)
    y <- sample(seq(-12, 12), 100, replace=TRUE)
    kk <- kde2d(x, y, h=c(30, 1.5), n=100, lims=c(1000, 2000, -12, 12))
    sum(kk$z)

which gives the answer 0.3932732. 给出答案0.3932732。 When using ksdensity2d in Matlab using the same exact data and conditions the answer is 0.3768. 在Matlab中使用ksdensity2d时,使用相同的确切数据和条件,答案为0.3768。 From looking at the code for kde2d I noticed that the bandwidth is divided by 4 从查看kde2d的代码,我注意到带宽除以4

    kde2d <- function (x, y, h, n = 25, lims = c(range(x), range(y))) 
    {
    nx <- length(x)
    if (length(y) != nx) 
     stop("data vectors must be the same length")
    if (any(!is.finite(x)) || any(!is.finite(y))) 
     stop("missing or infinite values in the data are not allowed")
    if (any(!is.finite(lims))) 
     stop("only finite values are allowed in 'lims'")
    n <- rep(n, length.out = 2L)
    gx <- seq.int(lims[1L], lims[2L], length.out = n[1L])
    gy <- seq.int(lims[3L], lims[4L], length.out = n[2L])
    h <- if (missing(h)) 
    c(bandwidth.nrd(x), bandwidth.nrd(y))
    else rep(h, length.out = 2L)
    if (any(h <= 0)) 
     stop("bandwidths must be strictly positive")
    h <- h/4
    ax <- outer(gx, x, "-")/h[1L]
    ay <- outer(gy, y, "-")/h[2L]
    z <- tcrossprod(matrix(dnorm(ax), , nx), matrix(dnorm(ay), 
     , nx))/(nx * h[1L] * h[2L])
    list(x = gx, y = gy, z = z)
    }

A simple check to see if the difference in bandwidth is the reason for the difference in the results is then 然后,简单检查以确定带宽差异是否是结果差异的原因

    kk <- kde2d(x, y, h=c(30, 1.5)*4, n=100, lims=c(1000, 2000, -12, 12))
    sum(kk$z)

which gives 0.3768013 (which is the same as the Matlab answer). 得到0.3768013(与Matlab答案相同)。

So my question is then: Why does kde2d divide the bandwidth by four? 所以我的问题是:为什么kde2d将带宽除以4? (Or why doesn't ksdensity2d?) (或者为什么没有ksdensity2d?)

At the mirrored github source , lines 31-35: 在镜像github源代码中 ,第31-35行:

if (any(h <= 0))
    stop("bandwidths must be strictly positive")
h <- h/4                            # for S's bandwidth scale
ax <- outer(gx, x, "-" )/h[1L]
ay <- outer(gy, y, "-" )/h[2L]

and the help file for kde2d() , which suggests looking at the help file for bandwidth . 以及kde2d()的帮助文件,它建议查看带宽的帮助文件。 That says: 说的是:

...which are all scaled to the width argument of density and so give answers four times as large. ...这些都被缩放到密度的宽度参数,所以给出四倍大的答案。

But why? 但为什么?

density() says that the width argument exists for the sake of compatibility with S (the precursor to R). density()表示width参数的存在是为了与S(R的前身)兼容。 The comments in the source for density() read: density() density()的注释读取:

## S has width equal to the length of the support of the kernel
## except for the gaussian where it is 4 * sd.
## R has bw a multiple of the sd.

The default is the Gaussian one. 默认值为高斯值。 When the bw argument is unspecified and width is, width is substituted in, eg. bw参数未指定且width bwwidth被替换为例如。

library(MASS)

set.seed(1)
x <- rnorm(1000, 10, 2)
all.equal(density(x, bw = 1), density(x, width = 4)) # Only the call is different

However, because kde2d() was apparently written to remain compatible with S (and I suppose it was originally written FOR S, given it's in MASS), everything ends up divided by four. 但是,因为kde2d()显然是为了与S保持兼容(并且我认为它最初写为FOR S,因为它在MASS中),所有内容最终除以4。 After flipping to the relevant section of MASS the book (around p.126), it seems they may have picked four to strike a balance between smoothness and fidelity of data. 在翻阅MASS的相关部分后(约第126页),似乎他们可能选择了四个来平衡数据的平滑性和保真度之间的平衡。

In conclusion, my guess is that kde2d() divides by four to remain consistent with the rest of MASS (and other things originally written for S), and that the way you're going about things looks fine. 总而言之,我的猜测是kde2d()除以4以保持与MASS的其余部分(以及最初为S编写的其他内容)一致,并且您对事物的处理方式看起来很好。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM