计算缺失值时的 Mahalanabois 距离

Question

In R, I am trying to calculate Mahalanobis distances to check if there are outliers in my data set, to test one of the assumptions for a MANOVA.在 R 中，我试图计算 Mahalanobis 距离以检查我的数据集中是否存在异常值，以测试 MANOVA 的假设之一。 I have missing values in my data set.我的数据集中缺少值。 I originally had tried the mahalanabois function, but that didn't seem to work with missing values, so I tried the MDmiss function in the modi package.我最初尝试过 mahalanabois 函数，但它似乎不适用于缺失值，所以我尝试了 modi 包中的 MDmiss 函数。 This worked for the cases where I had missing values in two of my variables both (DO, and chla).这适用于我的两个变量（DO 和 chla）都缺少值的情况。 However, if I was only missing data in chla or DO, the distances were not calculated.但是，如果我只是缺少 chla 或 DO 中的数据，则不会计算距离。 Neither the MDmiss nor the mahalanobis function returned distances when I lacked missing values.当缺少缺失值时，MDmiss 和 mahalanobis 函数都不会返回距离。

I had also tried using the is.na and na.omit arguments in the original Mahalanobis distances function, but that didn't work either.我也曾尝试在原始 Mahalanobis 距离函数中使用 is.na 和 na.omit 参数，但这也不起作用。 I have included a sample data set.我已经包含了一个样本数据集。 Appreciate the help.感谢帮助。 Thanks.谢谢。

envdata <- data.frame(WaterTemp = c(56.7, 56.4, 60.8,60.6, 59.3, 57.5, 57.9, 65.8,59.2, 59), SPC = c(46600, 47520, 47821, 47801, 47999, 47418, 47646, 49156, 46350, 46260), Salinity = c(30.28, 30.92, 31.54, 31.34, 31.24, 30.87, 31.03, 32.17, 30.12, 30.05), DO = c(NA, NA, 96, NA, NA, NA, NA, 101, 99, 103), Chla = c(7.045, NA, 8.358, NA, NA, NA, 6.306, 26.84, NA, NA))


#Check for outliers using the Mahalanobis distance
#https://www.statology.org/mahalanobis-distance-r/

#Mahalanobis only works on numeric data. Make new data frame with only numeric variables 
#Convert integers to numeric
envdata <- envdata %>% mutate(SPC = as.numeric(envdata$SPC), DO = as.numeric(envdata$DO))
envdata_numeric <- envdata %>% dplyr::select(WaterTemp, SPC, Salinity, DO, Chla)

#create new column in data frame to hold Mahalanobis distances
envdata_numeric$mahal <- mahalanobis(envdata_numeric, colMeans(envdata_numeric, na.rm = TRUE), cov(envdata_numeric))

#create new column in data frame to hold p-value for each Mahalanobis distance
envdata_numeric$p <- pchisq(envdata_numeric$mahal, df = 4, lower.tail = FALSE)
#Df = (c-1)
#DF = 5-1

envdata_numeric

#***#error with calculating distances. Possibly because of NA values. Try this other package. https://search.r-project.org/CRAN/refmans/modi/html/MDmiss.html
devtools::install_github("martinSter/modi")
library(modi)

#create new column in data frame to hold Mahalanobis distances
envdata_numeric$mahal <- MDmiss(envdata_numeric, colMeans(envdata_numeric), cov(envdata_numeric))

Answer 1

There is a problem with the data you shown, columns DO and Chal are collinear.您显示的数据有问题，列DO和Chal共线。 Namely you have only two complete observation (see Row 3 and 8 of envdata_numeric below):也就是说，您只有两个完整的观察结果（参见下面envdata_numeric的第 3 行和第 8 行）：

envdata_numeric <- structure(list(WaterTemp = c(56.7, 56.4, 60.8, 60.6, 59.3, 57.5, 
57.9, 65.8, 59.2, 59), SPC = c(46600, 47520, 47821, 47801, 47999, 
47418, 47646, 49156, 46350, 46260), Salinity = c(30.28, 30.92, 
31.54, 31.34, 31.24, 30.87, 31.03, 32.17, 30.12, 30.05), DO = c(NA, 
NA, 96, NA, NA, NA, NA, 101, 99, 103), Chla = c(7.045, NA, 8.358, 
NA, NA, NA, 6.306, 26.84, NA, NA)), class = "data.frame", row.names = c(NA, 
-10L))

# WaterTemp   SPC Salinity  DO   Chla
# 1       56.7 46600    30.28  NA  7.045
# 2       56.4 47520    30.92  NA     NA
# 3       60.8 47821    31.54  96  8.358
# 4       60.6 47801    31.34  NA     NA
# 5       59.3 47999    31.24  NA     NA
# 6       57.5 47418    30.87  NA     NA
# 7       57.9 47646    31.03  NA  6.306
# 8       65.8 49156    32.17 101 26.840
# 9       59.2 46350    30.12  99     NA
# 10      59.0 46260    30.05 103     NA

Roughly speaking you are trying to find outliers or calculate distances however you do not have enough information to "draw the elipsoid" around the cloud of your points.粗略地说，您正在尝试寻找异常值或计算距离，但是您没有足够的信息来围绕您的点云“绘制椭圆体”。 This is what geometrically mahalanobis is doing.这就是几何上的mahalanobis所做的。 I sketched the situation below: by white circles are columns without NA , big red are indicate two points which are located in higher dimensions (Row 3 and 8).我勾画了下面的情况：白色圆圈是没有NA的列，大红色表示位于更高维度（第 3 行和第 8 行）的两个点。 There are infinitely many elipsoids that can be drawn through 2 points and the center (I drew 2).通过2点加圆心可以画出无穷多个椭圆体（我画了2个）。

Anyway if I add some data point into DO column eg to Row 1 100 then proceed with imputation (I used mice package) I can formally calculate distances.无论如何，如果我将一些数据点添加到DO列中，例如添加到第 1 100行，然后继续进行插补（我使用mice包），我可以正式计算距离。 As you will see p-values will be > 0.1.正如您将看到的，p 值将 > 0.1。 The meaning that however the algorithm works, it is not enough to judge about outliers even on 3 observations.这意味着无论算法如何工作，即使根据 3 个观察值也不足以判断异常值。 Too much NA s. NA太多了。

library(mice)
envdata_numeric[1, "DO"] <- 100
envdata_numeric_imp <- complete(mice(envdata_numeric))
envdata_numeric_imp$maha <- mahalanobis(envdata_numeric_imp, 
                                        colMeans(envdata_numeric_imp), 
                                        cov(envdata_numeric_imp))

envdata_numeric_imp$p = pchisq(envdata_numeric_imp$maha, df = 4, 
                               lower.tail = FALSE)


envdata_numeric_imp

Output:输出：

Water   Temp   SPC Salinity  DO   Chla     maha         p
1       56.7 46600    30.28 100  7.045 1.274517 0.8656841
2       56.4 47520    30.92 103  7.045 3.554027 0.4697112
3       60.8 47821    31.54  96  8.358 7.201919 0.1255948
4       60.6 47801    31.34 103  6.306 3.968202 0.4103263
5       59.3 47999    31.24  96  6.306 5.790871 0.2153200
6       57.5 47418    30.87 101 26.840 6.985705 0.1366456
7       57.9 47646    31.03 101  6.306 1.523915 0.8223970
8       65.8 49156    32.17 101 26.840 7.254101 0.1230542
9       59.2 46350    30.12  99  6.306 3.556350 0.4693622
10      59.0 46260    30.05 103  7.045 3.890395 0.4210425

计算缺失值时的 Mahalanabois 距离

问题描述

1 个解决方案

解决方案1
0 2022-12-14 22:08:23

计算缺失值时的 Mahalanabois 距离

问题描述

1 个解决方案

解决方案1 0 2022-12-14 22:08:23

解决方案1
0 2022-12-14 22:08:23