我如何找到来自 MASS 的 LDA function 指定观察属于哪个 class 的分数？

Question

I have a dataset of body measurements for birds and I'm using the lda function from the MASS package to find out the extent of sexual dimorphism.我有一个鸟类身体测量数据集，我正在使用 MASS package 中的 lda function 来找出两性异形的程度。 Eventually, I want to end up with an equation and critical score that can be used in the field (without access to computers or R) to determine if the bird in hand is male or female.最终，我想得出一个方程式和临界分数，可以在现场使用（无需使用计算机或 R）来确定手中的鸟是雄性还是雌性。 In our data set, there are more males than females.在我们的数据集中，男性多于女性。 I don't know exactly why that is, but for now, I'm assuming this means there is a real reason why males are captured more often than females, though our dataset is only 34 birds so this might not be significant.我不知道为什么会这样，但就目前而言，我假设这意味着雄性比雌性更容易被捕获是有真正原因的，尽管我们的数据集只有 34 只鸟，所以这可能并不重要。

I know how to extract/determine the equation (following the instructions halfway down the page here: https://stats.stackexchange.com/questions/157772/how-to-find-the-line ) but there is some overlap in the D-scores where the predict.lda function seems to go either way.我知道如何提取/确定方程式（按照此处页面中间的说明进行操作： https://stats.stackexchange.com/questions/157772/how-to-find-the-line ）但是有一些重叠无论哪种方式，预测.lda function 似乎为 go 的 D 分数。 I expected the critical D-score to be 0 but it's not...我预计临界 D 分数为 0，但它不是......

I would like to know how I can find 1) the D-score where the model will always determine the bird is female (or male), 2) what the extent of the overlap is.我想知道如何找到 1) D 分数，其中 model 将始终确定这只鸟是雌性（或雄性），2）重叠的程度是多少。

Mock code (with the real data there is more overlap):模拟代码（与真实数据有更多重叠）：

set.seed(42) 

train <- data.frame(sex = c(rep("F", 35), rep("M", 65)),
                   A = c(rnorm(35, 20, 2.5), rnorm(65, 15, 2.5)),
                   B = c(rnorm(35, 6, 0.2), rnorm(65, 5.5, 0.2)),
                   C = c(rnorm(35, 250, 5), rnorm(65, 240, 5)),
                   D = c(rnorm(35, 450, 25), rnorm(65, 350, 25)))

mod <- lda(sex ~ ., data = train)
mod

gm = mod$prior %*% mod$means # these are used to get the equation
const = drop(gm %*% mod$scaling)

#the equation is then: D = mod$scaling[1] * A + mod$scaling[2] * B + mod$scaling[3] * C + mod$scaling[4] * D - const

test <- data.frame(sex = c(rep("F", 350), rep("M", 650)),
                  A = rnorm(1000, gm[1], 2.5),
                  B = rnorm(1000, gm[2], 0.2),
                  C = rnorm(1000, gm[3], 5),
                  D = rnorm(1000, gm[4], 25))

pred <- data.frame(predict(mod, test)$x, class = predict(mod, test)$class)

I've Googled a lot and looked at several stack exchange and stack overflow questions, but I can't figure it out.我在谷歌上搜索了很多，并查看了几个堆栈交换和堆栈溢出问题，但我无法弄清楚。

Answer 1

For your example data the quantiles for male and female:对于您的示例数据，男性和女性的分位数：

by(D, train$sex, quantile)
# train$sex: F
#        0%       25%       50%       75%      100% 
# -6.271599 -4.489364 -3.770150 -3.017528 -1.327032 
# ----------------------------------------------------------------------------
# train$sex: M
#         0%        25%        50%        75%       100% 
# -0.8563099  1.5266578  1.9219727  2.7991112  3.8717447

There is no overlap for this example.此示例没有重叠。 D values less than -1.327 are always female and values greater than -.856 are always male.小于 -1.327 的 D 值始终为女性，大于 -.856 的值始终为男性。 If the ranges overlap, then you will have to decide whether to flip a coin or record them as uncertain.如果范围重叠，那么您将不得不决定是抛硬币还是将它们记录为不确定。

You can get a more detailed view by looking at the posterior probabilities:您可以通过查看后验概率获得更详细的视图：

pred.tr <- as.data.frame(predict(mod))
idx <- order(pred.tr$LD1)
pred.srt <- pred.tr[idx, ]
pred.srt
#     class  posterior.F  posterior.M        LD1
# 4       F 1.000000e+00 3.895671e-14 -6.2715995
# 25      F 1.000000e+00 7.087004e-14 -6.1690763
# 35      F 1.000000e+00 5.234647e-12 -5.4319799
# 2       F 1.000000e+00 9.615516e-11 -4.9332964
# 18      F 1.000000e+00 1.017526e-10 -4.9236025
#  . . . .
# 13      F 9.996574e-01 3.426315e-04 -2.3485213
# 28      F 9.996073e-01 3.926946e-04 -2.3251473
# 19      F 8.825072e-01 1.174928e-01 -1.3270319 # <- Last female
# 81      M 3.249597e-01 6.750403e-01 -0.8563099 # <- First male
# 80      M 2.324926e-04 9.997675e-01  0.4518529
# 46      M 2.247020e-04 9.997753e-01  0.4576938
# . . . .
# 36      M 1.282832e-11 1.000000e+00  3.3152791
# 39      M 2.153913e-12 1.000000e+00  3.6209947
# 52      M 1.169887e-12 1.000000e+00  3.7255708
# 82      M 8.625676e-13 1.000000e+00  3.7777833
# 59      M 4.984432e-13 1.000000e+00  3.8717447

You could also use the test data instead of the training data, to see if the boundary between male and female is fuzzier than the training data suggest.您还可以使用测试数据而不是训练数据，看看男性和女性之间的界限是否比训练数据显示的更模糊。 The posterior probabilities indicate that for LD1 values less than -1.327 the probability of being female is essentially 100%.后验概率表明，对于小于 -1.327 的 LD1 值，女性的概率基本上为 100%。 For values of -.856 the probability of being male is 67.5% and by.452 and above it is essentially 100%.对于 -.856 的值，男性的概率为 67.5%，对于 .452 及以上，则基本上为 100%。

我如何找到来自 MASS 的 LDA function 指定观察属于哪个 class 的分数？

问题描述

1 个解决方案

解决方案1
1 已采纳 2022-11-23 04:30:33

我如何找到来自 MASS 的 LDA function 指定观察属于哪个 class 的分数？

问题描述

1 个解决方案

解决方案1 1 已采纳 2022-11-23 04:30:33

解决方案1
1 已采纳 2022-11-23 04:30:33