简体   繁体   English

我如何找到来自 MASS 的 LDA function 指定观察属于哪个 class 的分数?

[英]How do I find the scores at which the LDA function from MASS specifies to which class an observation belongs?

I have a dataset of body measurements for birds and I'm using the lda function from the MASS package to find out the extent of sexual dimorphism.我有一个鸟类身体测量数据集,我正在使用 MASS package 中的 lda function 来找出两性异形的程度。 Eventually, I want to end up with an equation and critical score that can be used in the field (without access to computers or R) to determine if the bird in hand is male or female.最终,我想得出一个方程式和临界分数,可以在现场使用(无需使用计算机或 R)来确定手中的鸟是雄性还是雌性。 In our data set, there are more males than females.在我们的数据集中,男性多于女性。 I don't know exactly why that is, but for now, I'm assuming this means there is a real reason why males are captured more often than females, though our dataset is only 34 birds so this might not be significant.我不知道为什么会这样,但就目前而言,我假设这意味着雄性比雌性更容易被捕获是有真正原因的,尽管我们的数据集只有 34 只鸟,所以这可能并不重要。

I know how to extract/determine the equation (following the instructions halfway down the page here: https://stats.stackexchange.com/questions/157772/how-to-find-the-line ) but there is some overlap in the D-scores where the predict.lda function seems to go either way.我知道如何提取/确定方程式(按照此处页面中间的说明进行操作: https://stats.stackexchange.com/questions/157772/how-to-find-the-line )但是有一些重叠无论哪种方式,预测.lda function 似乎为 go 的 D 分数。 I expected the critical D-score to be 0 but it's not...我预计临界 D 分数为 0,但它不是......

I would like to know how I can find 1) the D-score where the model will always determine the bird is female (or male), 2) what the extent of the overlap is.我想知道如何找到 1) D 分数,其中 model 将始终确定这只鸟是雌性(或雄性),2)重叠的程度是多少。

Mock code (with the real data there is more overlap):模拟代码(与真实数据有更多重叠):

set.seed(42) 

train <- data.frame(sex = c(rep("F", 35), rep("M", 65)),
                   A = c(rnorm(35, 20, 2.5), rnorm(65, 15, 2.5)),
                   B = c(rnorm(35, 6, 0.2), rnorm(65, 5.5, 0.2)),
                   C = c(rnorm(35, 250, 5), rnorm(65, 240, 5)),
                   D = c(rnorm(35, 450, 25), rnorm(65, 350, 25)))

mod <- lda(sex ~ ., data = train)
mod

gm = mod$prior %*% mod$means # these are used to get the equation
const = drop(gm %*% mod$scaling)

#the equation is then: D = mod$scaling[1] * A + mod$scaling[2] * B + mod$scaling[3] * C + mod$scaling[4] * D - const

test <- data.frame(sex = c(rep("F", 350), rep("M", 650)),
                  A = rnorm(1000, gm[1], 2.5),
                  B = rnorm(1000, gm[2], 0.2),
                  C = rnorm(1000, gm[3], 5),
                  D = rnorm(1000, gm[4], 25))

pred <- data.frame(predict(mod, test)$x, class = predict(mod, test)$class)

在此处输入图像描述

I've Googled a lot and looked at several stack exchange and stack overflow questions, but I can't figure it out.我在谷歌上搜索了很多,并查看了几个堆栈交换和堆栈溢出问题,但我无法弄清楚。

For your example data the quantiles for male and female:对于您的示例数据,男性和女性的分位数:

by(D, train$sex, quantile)
# train$sex: F
#        0%       25%       50%       75%      100% 
# -6.271599 -4.489364 -3.770150 -3.017528 -1.327032 
# ----------------------------------------------------------------------------
# train$sex: M
#         0%        25%        50%        75%       100% 
# -0.8563099  1.5266578  1.9219727  2.7991112  3.8717447 

There is no overlap for this example.此示例没有重叠。 D values less than -1.327 are always female and values greater than -.856 are always male.小于 -1.327 的 D 值始终为女性,大于 -.856 的值始终为男性。 If the ranges overlap, then you will have to decide whether to flip a coin or record them as uncertain.如果范围重叠,那么您将不得不决定是抛硬币还是将它们记录为不确定。

You can get a more detailed view by looking at the posterior probabilities:您可以通过查看后验概率获得更详细的视图:

pred.tr <- as.data.frame(predict(mod))
idx <- order(pred.tr$LD1)
pred.srt <- pred.tr[idx, ]
pred.srt
#     class  posterior.F  posterior.M        LD1
# 4       F 1.000000e+00 3.895671e-14 -6.2715995
# 25      F 1.000000e+00 7.087004e-14 -6.1690763
# 35      F 1.000000e+00 5.234647e-12 -5.4319799
# 2       F 1.000000e+00 9.615516e-11 -4.9332964
# 18      F 1.000000e+00 1.017526e-10 -4.9236025
#  . . . .
# 13      F 9.996574e-01 3.426315e-04 -2.3485213
# 28      F 9.996073e-01 3.926946e-04 -2.3251473
# 19      F 8.825072e-01 1.174928e-01 -1.3270319 # <- Last female
# 81      M 3.249597e-01 6.750403e-01 -0.8563099 # <- First male
# 80      M 2.324926e-04 9.997675e-01  0.4518529
# 46      M 2.247020e-04 9.997753e-01  0.4576938
# . . . .
# 36      M 1.282832e-11 1.000000e+00  3.3152791
# 39      M 2.153913e-12 1.000000e+00  3.6209947
# 52      M 1.169887e-12 1.000000e+00  3.7255708
# 82      M 8.625676e-13 1.000000e+00  3.7777833
# 59      M 4.984432e-13 1.000000e+00  3.8717447

You could also use the test data instead of the training data, to see if the boundary between male and female is fuzzier than the training data suggest.您还可以使用测试数据而不是训练数据,看看男性和女性之间的界限是否比训练数据显示的更模糊。 The posterior probabilities indicate that for LD1 values less than -1.327 the probability of being female is essentially 100%.后验概率表明,对于小于 -1.327 的 LD1 值,女性的概率基本上为 100%。 For values of -.856 the probability of being male is 67.5% and by.452 and above it is essentially 100%.对于 -.856 的值,男性的概率为 67.5%,对于 .452 及以上,则基本上为 100%。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何找到MASS软件包中R的lda()函数使用的切点值? - How do I find the cut point value used by the lda( ) function in MASS package for R? 如何创建一个新列来指定日期所属的年份范围(如学年)? - How to create a new column that specifies which range of years a date belongs to (like academic year)? 查找股票属于哪个行业 - Find which sector a stock belongs to 在R中运行LDA函数时不断出现错误,我在使用MASS库作为LDA - Keep getting an error when running the LDA function in R, I am using the MASS library for the LDA 如何将第一个数据框的观察结果与第二个数据框的观察日期落在 R 的开始和结束日期间隔进行子集化? - How do I subset observations from first data frame with start and end dates interval in which the second data frame's observation date falls in R? 在 R 中,如何保存 dataframe,它是使用 print() 和 cat() 的 function 的输出之一? - In R, how do I save the dataframe which is one of the outputs from a function which uses print() and cat()? 如何创建一个变量来告诉我其他变量中的哪一个是第一个对于一个观察没有缺失值的变量? - How do I create a variable that tells me which of a number of other variables is the first one to not have a missing value for one observation? 如何通过 R 中的每个观察找到最常用的单词? - How do I find most frequent words by each observation in R? 如何识别值属于哪个因子组? - How can I identify which factor group a value belongs to? 如何确定使用R泛型时调用了哪个函数? - How do I find out which function is called when using R generics?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM