[英]calculate Mahalanabois distances when have missing values
In R, I am trying to calculate Mahalanobis distances to check if there are outliers in my data set, to test one of the assumptions for a MANOVA.在 R 中,我试图计算 Mahalanobis 距离以检查我的数据集中是否存在异常值,以测试 MANOVA 的假设之一。 I have missing values in my data set.
我的数据集中缺少值。 I originally had tried the mahalanabois function, but that didn't seem to work with missing values, so I tried the MDmiss function in the modi package.
我最初尝试过 mahalanabois 函数,但它似乎不适用于缺失值,所以我尝试了 modi 包中的 MDmiss 函数。 This worked for the cases where I had missing values in two of my variables both (DO, and chla).
这适用于我的两个变量(DO 和 chla)都缺少值的情况。 However, if I was only missing data in chla or DO, the distances were not calculated.
但是,如果我只是缺少 chla 或 DO 中的数据,则不会计算距离。 Neither the MDmiss nor the mahalanobis function returned distances when I lacked missing values.
当缺少缺失值时,MDmiss 和 mahalanobis 函数都不会返回距离。
I had also tried using the is.na and na.omit arguments in the original Mahalanobis distances function, but that didn't work either.我也曾尝试在原始 Mahalanobis 距离函数中使用 is.na 和 na.omit 参数,但这也不起作用。 I have included a sample data set.
我已经包含了一个样本数据集。 Appreciate the help.
感谢帮助。 Thanks.
谢谢。
envdata <- data.frame(WaterTemp = c(56.7, 56.4, 60.8,60.6, 59.3, 57.5, 57.9, 65.8,59.2, 59), SPC = c(46600, 47520, 47821, 47801, 47999, 47418, 47646, 49156, 46350, 46260), Salinity = c(30.28, 30.92, 31.54, 31.34, 31.24, 30.87, 31.03, 32.17, 30.12, 30.05), DO = c(NA, NA, 96, NA, NA, NA, NA, 101, 99, 103), Chla = c(7.045, NA, 8.358, NA, NA, NA, 6.306, 26.84, NA, NA))
#Check for outliers using the Mahalanobis distance
#https://www.statology.org/mahalanobis-distance-r/
#Mahalanobis only works on numeric data. Make new data frame with only numeric variables
#Convert integers to numeric
envdata <- envdata %>% mutate(SPC = as.numeric(envdata$SPC), DO = as.numeric(envdata$DO))
envdata_numeric <- envdata %>% dplyr::select(WaterTemp, SPC, Salinity, DO, Chla)
#create new column in data frame to hold Mahalanobis distances
envdata_numeric$mahal <- mahalanobis(envdata_numeric, colMeans(envdata_numeric, na.rm = TRUE), cov(envdata_numeric))
#create new column in data frame to hold p-value for each Mahalanobis distance
envdata_numeric$p <- pchisq(envdata_numeric$mahal, df = 4, lower.tail = FALSE)
#Df = (c-1)
#DF = 5-1
envdata_numeric
#***#error with calculating distances. Possibly because of NA values. Try this other package. https://search.r-project.org/CRAN/refmans/modi/html/MDmiss.html
devtools::install_github("martinSter/modi")
library(modi)
#create new column in data frame to hold Mahalanobis distances
envdata_numeric$mahal <- MDmiss(envdata_numeric, colMeans(envdata_numeric), cov(envdata_numeric))
There is a problem with the data you shown, columns DO
and Chal
are collinear.您显示的数据有问题,列
DO
和Chal
共线。 Namely you have only two complete observation (see Row 3 and 8 of envdata_numeric
below):也就是说,您只有两个完整的观察结果(参见下面
envdata_numeric
的第 3 行和第 8 行):
envdata_numeric <- structure(list(WaterTemp = c(56.7, 56.4, 60.8, 60.6, 59.3, 57.5,
57.9, 65.8, 59.2, 59), SPC = c(46600, 47520, 47821, 47801, 47999,
47418, 47646, 49156, 46350, 46260), Salinity = c(30.28, 30.92,
31.54, 31.34, 31.24, 30.87, 31.03, 32.17, 30.12, 30.05), DO = c(NA,
NA, 96, NA, NA, NA, NA, 101, 99, 103), Chla = c(7.045, NA, 8.358,
NA, NA, NA, 6.306, 26.84, NA, NA)), class = "data.frame", row.names = c(NA,
-10L))
# WaterTemp SPC Salinity DO Chla
# 1 56.7 46600 30.28 NA 7.045
# 2 56.4 47520 30.92 NA NA
# 3 60.8 47821 31.54 96 8.358
# 4 60.6 47801 31.34 NA NA
# 5 59.3 47999 31.24 NA NA
# 6 57.5 47418 30.87 NA NA
# 7 57.9 47646 31.03 NA 6.306
# 8 65.8 49156 32.17 101 26.840
# 9 59.2 46350 30.12 99 NA
# 10 59.0 46260 30.05 103 NA
Roughly speaking you are trying to find outliers or calculate distances however you do not have enough information to "draw the elipsoid" around the cloud of your points.粗略地说,您正在尝试寻找异常值或计算距离,但是您没有足够的信息来围绕您的点云“绘制椭圆体”。 This is what geometrically
mahalanobis
is doing.这就是几何上的
mahalanobis
所做的。 I sketched the situation below: by white circles are columns without NA
, big red are indicate two points which are located in higher dimensions (Row 3 and 8).我勾画了下面的情况:白色圆圈是没有
NA
的列,大红色表示位于更高维度(第 3 行和第 8 行)的两个点。 There are infinitely many elipsoids that can be drawn through 2 points and the center (I drew 2).通过2点加圆心可以画出无穷多个椭圆体(我画了2个)。
Anyway if I add some data point into DO
column eg to Row 1 100
then proceed with imputation (I used mice
package) I can formally calculate distances.无论如何,如果我将一些数据点添加到
DO
列中,例如添加到第 1 100
行,然后继续进行插补(我使用mice
包),我可以正式计算距离。 As you will see p-values will be > 0.1.正如您将看到的,p 值将 > 0.1。 The meaning that however the algorithm works, it is not enough to judge about outliers even on 3 observations.
这意味着无论算法如何工作,即使根据 3 个观察值也不足以判断异常值。 Too much
NA
s. NA
太多了。
library(mice)
envdata_numeric[1, "DO"] <- 100
envdata_numeric_imp <- complete(mice(envdata_numeric))
envdata_numeric_imp$maha <- mahalanobis(envdata_numeric_imp,
colMeans(envdata_numeric_imp),
cov(envdata_numeric_imp))
envdata_numeric_imp$p = pchisq(envdata_numeric_imp$maha, df = 4,
lower.tail = FALSE)
envdata_numeric_imp
Output:输出:
Water Temp SPC Salinity DO Chla maha p
1 56.7 46600 30.28 100 7.045 1.274517 0.8656841
2 56.4 47520 30.92 103 7.045 3.554027 0.4697112
3 60.8 47821 31.54 96 8.358 7.201919 0.1255948
4 60.6 47801 31.34 103 6.306 3.968202 0.4103263
5 59.3 47999 31.24 96 6.306 5.790871 0.2153200
6 57.5 47418 30.87 101 26.840 6.985705 0.1366456
7 57.9 47646 31.03 101 6.306 1.523915 0.8223970
8 65.8 49156 32.17 101 26.840 7.254101 0.1230542
9 59.2 46350 30.12 99 6.306 3.556350 0.4693622
10 59.0 46260 30.05 103 7.045 3.890395 0.4210425
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.