[英]Simple example calculating Mahalanobis distance between two groups in R
I'm trying to reproduce this example using Excel to calculate the Mahalanobis distance between two groups. 我正在尝试使用Excel重现此示例,以计算两组之间的马氏距离。
To my mind the example provides a good explanation of the concept. 在我看来,该示例很好地说明了这一概念。 However, I'm not able to reproduce in R.
但是,我无法在R中复制。
The result obtained in the example using Excel is Mahalanobis(g1, g2) = 1.4104
. 在使用Excel的示例中获得的结果为
Mahalanobis(g1, g2) = 1.4104
。
Following the answer given here for R and apply it to the data above as follows: 遵循此处为R给出的答案,并将其应用于以下数据,如下所示:
# dataset used in the Excel example
g1 <- matrix(c(2, 2, 2, 5, 6, 5, 7, 3, 4, 7, 6, 4, 5, 3, 4, 6, 2, 5, 1, 3), ncol = 2, byrow = TRUE)
g2 <- matrix(c(6, 5, 7, 4, 8, 7, 5, 6, 5, 4), ncol = 2, byrow = TRUE)
# function adopted from R example
D.sq <- function (g1, g2) {
dbar <- as.vector(colMeans(g1) - colMeans(g2))
S1 <- cov(g1)
S2 <- cov(g2)
n1 <- nrow(g1)
n2 <- nrow(g2)
V <- as.matrix((1/(n1 + n2 - 2)) * (((n1 - 1) * S1) + ((n2 - 1) * S2)))
D.sq <- t(dbar) %*% solve(V) %*% dbar
res <- list()
res$D.sq <- D.sq
res$V <- V
res
}
D.sq(g1,g2)
and executing the function on the data returns the following output: 在数据上执行该函数将返回以下输出:
$D.sq
[,1]
[1,] 1.724041
$V
[,1] [,2]
[1,] 3.5153846 0.3153846
[2,] 0.3153846 2.2230769
Afaik $D.sq
represents the distance and 1.724
is significantly different to the 1.4101
result from the Excel example. Afaik
$D.sq
表示距离,而1.724
与Excel示例中的1.4101
结果有显着差异。 As I'm new to the concept of the Mahalanobis distance I was wondering if I did something wrong and/or there's a better way to calculate this eg using mahalanobis() ? 由于我对马哈拉诺比斯距离的概念不熟悉,所以我想知道我是否做错了什么,和/或有更好的方法来计算这一点,例如使用mahalanobis() ?
The reasons why do you get different result are 您得到不同结果的原因是
The Excel algorithm is actually different to the R algorithm in how you calculate the pooled covariance matrix, the R version gives you the result of unbiased estimate of covariance matrix, while the Excel version gives you the MLE estimate. Excel算法实际上与R算法的不同之处在于如何计算合并的协方差矩阵,R版本为您提供协方差矩阵的无偏估计结果,而Excel版本为您提供MLE估计。 In the R version, you calculate the matrix like:
((n1 - 1) * cov(g1) + (n2 - 1) * cov(g2)) / (n1 + n2 - 2)
; 在R版本中,您可以像这样计算矩阵:
((n1 - 1) * cov(g1) + (n2 - 1) * cov(g2)) / (n1 + n2 - 2)
; while in Excel version: ((n1 - 1) * cov(g1) + (n2 - 1) * cov(g2)) / (n1 + n2)
. 而在Excel版本中:
((n1 - 1) * cov(g1) + (n2 - 1) * cov(g2)) / (n1 + n2)
。
The last calculation step in the Excel post you refer to is incorrect, the result should be 1.989278 instead. 您引用的Excel帖子中的最后一个计算步骤不正确,结果应为1.989278。
Edit: 编辑:
The unbiased estimator for pooled covariance matrix is the standard way, as is in the Wikipedia page: https://en.wikipedia.org/wiki/Pooled_variance . 合并协方差矩阵的无偏估计量是标准方法,就像Wikipedia页面上一样: https : //en.wikipedia.org/wiki/Pooled_variance 。 A related fact is that in R, when you use
cov
or var
, you get an unbiased estimator instead of MLE estimator for covariance matrix. 一个相关的事实是,在R中,当您使用
cov
或var
,对于协方差矩阵,您将获得一个无偏估计量而不是MLE估计量。
Edit2: The mahalanobis function in R calculates the mahalanobis distance from points to a distribution. Edit2:R中的马哈拉诺比斯函数计算从点到分布的马哈拉诺比斯距离。 It does not calculate the mahalanobis distance of two samples.
它不计算两个样本的马氏距离。
Conclusion: In sum, the most standard way to calculate mahalanobis distance between two samples is the R code in the original post, which uses the unbiased estimator of pooled covariance matrix. 结论:总而言之,计算两个样本之间马哈拉诺比斯距离的最标准方法是原始文章中的R代码,该代码使用合并协方差矩阵的无偏估计量。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.