简体   繁体   English

R:在ggplot2中绘制线性判别分析的后验分类概率

[英]R: plotting posterior classification probabilities of a linear discriminant analysis in ggplot2

Using ggord one can make nice linear discriminant analysis ggplot2 biplots (cf chapter 11, Fig 11.5 in "Biplots in practice" by M. Greenacre), as in 使用ggord可以做出很好的线性判别分析ggplot2 (参见M. Greenacre的“实践中的ggplot2第11章,图11.5),如

library(MASS)
install.packages("devtools")
library(devtools)
install_github("fawda123/ggord")
library(ggord)
data(iris)
ord <- lda(Species ~ ., iris, prior = rep(1, 3)/3)
ggord(ord, iris$Species)

在此输入图像描述

I would also like to add the classification regions (shown as solid regions of the same colour as their respective group with say alpha=0.5) or the posterior probabilities of class membership (with alpha then varying according to this posterior probability and the same colour as used for each group) (as can be done in BiplotGUI , but I am looking for a ggplot2 solution). 我还想添加分类区域(显示为与其各自组相同颜色的实心区域,例如α= 0.5)或类别隶属度的后验概率(随后alpha根据此后验概率和相同颜色变化用于每个组)(可以在BiplotGUI完成,但我正在寻找一个ggplot2解决方案)。 Would anyone know how to do this with ggplot2 , perhaps using geom_tile ? 有谁知道如何使用ggplot2 ,也许使用geom_tile

EDIT: below someone asks how to calculate the posterior classification probabilities & predicted classes. 编辑:下面有人询问如何计算后验分类概率和预测类别。 This goes like this: 这是这样的:

library(MASS)
library(ggplot2)
library(scales)
fit <- lda(Species ~ ., data = iris, prior = rep(1, 3)/3)
datPred <- data.frame(Species=predict(fit)$class,predict(fit)$x)
#Create decision boundaries
fit2 <- lda(Species ~ LD1 + LD2, data=datPred, prior = rep(1, 3)/3)
ld1lim <- expand_range(c(min(datPred$LD1),max(datPred$LD1)),mul=0.05)
ld2lim <- expand_range(c(min(datPred$LD2),max(datPred$LD2)),mul=0.05)
ld1 <- seq(ld1lim[[1]], ld1lim[[2]], length.out=300)
ld2 <- seq(ld2lim[[1]], ld1lim[[2]], length.out=300)
newdat <- expand.grid(list(LD1=ld1,LD2=ld2))
preds <-predict(fit2,newdata=newdat)
predclass <- preds$class
postprob <- preds$posterior
df <- data.frame(x=newdat$LD1, y=newdat$LD2, class=predclass)
df$classnum <- as.numeric(df$class)
df <- cbind(df,postprob)
head(df)

           x        y     class classnum       setosa   versicolor virginica
1 -10.122541 -2.91246 virginica        3 5.417906e-66 1.805470e-10         1
2 -10.052563 -2.91246 virginica        3 1.428691e-65 2.418658e-10         1
3  -9.982585 -2.91246 virginica        3 3.767428e-65 3.240102e-10         1
4  -9.912606 -2.91246 virginica        3 9.934630e-65 4.340531e-10         1
5  -9.842628 -2.91246 virginica        3 2.619741e-64 5.814697e-10         1
6  -9.772650 -2.91246 virginica        3 6.908204e-64 7.789531e-10         1

colorfun <- function(n,l=65,c=100) { hues = seq(15, 375, length=n+1); hcl(h=hues, l=l, c=c)[1:n] } # default ggplot2 colours
colors <- colorfun(3)
colorslight <- colorfun(3,l=90,c=50)
ggplot(datPred, aes(x=LD1, y=LD2) ) +
    geom_raster(data=df, aes(x=x, y=y, fill = factor(class)),alpha=0.7,show_guide=FALSE) +
    geom_contour(data=df, aes(x=x, y=y, z=classnum), colour="red2", alpha=0.5, breaks=c(1.5,2.5)) +
    geom_point(data = datPred, size = 3, aes(pch = Species,  colour=Species)) +
    scale_x_continuous(limits = ld1lim, expand=c(0,0)) +
    scale_y_continuous(limits = ld2lim, expand=c(0,0)) +
    scale_fill_manual(values=colorslight,guide=F)

在此输入图像描述

(well not totally sure this approach for showing classification boundaries using contours/breaks at 1.5 and 2.5 is always correct - it is correct for the boundary between species 1 and 2 and species 2 and 3, but not if the region of species 1 would be next to species 3, as I would get two boundaries there then - maybe I would have to use the approach used here where each boundary between each species pair is considered separately) (并不完全确定这种使用1.5和2.5的轮廓/间隔显示分类边界的方法总是正确的 - 对于物种1和2以及物种2和3之间的边界是正确的,但如果物种1的区域是在物种3旁边,因为那时我会得到两个边界 - 也许我将不得不使用这里使用的方法其中每个物种对之间的每个边界被单独考虑)

This gets me as far as plotting the classification regions. 这使得我可以绘制分类区域。 I am looking for a solution though to also plot the actual posterior classification probabilities for each species at each coordinate, using alpha (opaqueness) proportional to the posterior classification probability for each species, and a species-specific colour. 我正在寻找一种解决方案,同时也绘制每个物种在每个坐标处的实际后验分类概率,使用与每个物种的后验分类概率成比例的α(不透明度)和物种特定的颜色。 In other words, with a stack of three images superimposed. 换句话说,叠加三个图像的堆叠。 As alpha blending in ggplot2 is known to be order-dependent , I think the colours of this stack would have to calculated beforehand though, and plotted using something like 由于已知ggplot2中的alpha混合是依赖顺序的 ,我认为此堆栈的颜色必须事先计算,并使用类似的东西绘制

qplot(x, y, data=mydata, fill=rgb, geom="raster") + scale_fill_identity() 

Here is a SAS example of what I am after : 这是我所追求的SAS示例

在此输入图像描述

Would anyone know how to do this perhaps? 也许有人知道怎么做吗? Or does anyone have any thoughts on how to best represent these posterior classification probabilities? 或者是否有人对如何最好地表示这些后验分类概率有任何想法?

Note that the method should work for any number of groups, not just for this specific example. 请注意,该方法应适用于任意数量的组,而不仅仅适用于此特定示例。

I suppose the easiest way will be to show the posterior probabilities. 我想最简单的方法是显示后验概率。 It is pretty straightforward for your case: 对你的案子来说非常简单:

datPred$maxProb <- apply(predict(fit)$posterior, 1, max)
ggplot(datPred, aes(x=LD1, y=LD2) ) +
  geom_raster(data=df, aes(x=x, y=y, fill = factor(class)),alpha=0.7,show_guide=FALSE) +
  geom_contour(data=df, aes(x=x, y=y, z=classnum), colour="red2", alpha=0.5, breaks=c(1.5,2.5)) +
  geom_point(data = datPred, size = 3, aes(pch = Species,  colour=Species, alpha = maxProb)) +
  scale_x_continuous(limits = ld1lim, expand=c(0,0)) +
  scale_y_continuous(limits = ld2lim, expand=c(0,0)) +
  scale_fill_manual(values=colorslight, guide=F)

在此输入图像描述

You can see the points blend in at blue-green border. 您可以看到这些点以蓝绿色边框混合。

Also just came up with the following easy solution: just make a column in df where class predictions are made stochastically, according to the posterior probabilities, which then results in dithering in uncertain regions, eg as in 还提出了以下简单的解决方案:只需在df中创建一个列,其中随机地进行类预测,根据后验概率,然后导致不确定区域中的抖动,例如

fit = lda(Species ~ Sepal.Length + Sepal.Width, data = iris, prior = rep(1, 3)/3)
ld1lim <- expand_range(c(min(datPred$LD1),max(datPred$LD1)),mul=0.5)
ld2lim <- expand_range(c(min(datPred$LD2),max(datPred$LD2)),mul=0.5)

rest as above, and inserting 如上所述,插入

lvls=unique(df$class)
df$classpprob=apply(df[,as.character(lvls)],1,function(row) sample(lvls,1,prob=row))

p=ggplot(datPred, aes(x=LD1, y=LD2) ) +
  geom_raster(data=df, aes(x=x, y=y, fill = factor(classpprob)),hpad=0, vpad=0, alpha=0.7,show_guide=FALSE) +
  geom_point(data = datPred, size = 3, aes(pch = Group,  colour=Group)) +
  scale_fill_manual(values=colorslight,guide=F) +
  scale_x_continuous(limits=rngs[[1]], expand=c(0,0)) +
  scale_y_continuous(limits=rngs[[2]], expand=c(0,0))

gives me 给我 在此输入图像描述

A lot easier and clearer than starting to mix colours in some additive or subtractive fashion anyway (which is the part where I still had trouble, and which apparently is not so trivial to do well). 比起以某种加成或减少方式混合颜色要容易和清晰得多(这是我仍然遇到麻烦的部分,而且显然不是那么容易做得好)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM