简体   繁体   English

从flexmix对象预测(R)

[英]predicting from flexmix object (R)

I fit some data to a mixture distribution of two gaussian in flexmix : 我将一些数据拟合为flexmix中两个高斯的混合分布:

data("NPreg", package = "flexmix")
mod <- flexmix(yn ~ x, data = NPreg, k = 2,
           model = list(FLXMRglm(yn ~ x, family= "gaussian"),
                        FLXMRglm(yn ~ x, family = "gaussian")))

the model fit is as follows: 模型拟合如下:

> mod

Call:
flexmix(formula = yn ~ x, data = NPreg, k = 2, model =    list(FLXMRglm(yn ~ x, family = "gaussian"), 
    FLXMRglm(yn ~ x, family = "gaussian")))

Cluster sizes:
  1   2 
 74 126 

convergence after 31 iterations

But how do I actually predict from this model? 但是我实际上如何从该模型进行预测?

when I do 当我做

pred <- predict(mod, NPreg)

I get a list with the predictions from each of the two components 我得到了两个组成部分的预测清单

To get a single prediction, do I have to add in the cluster sizes like this? 要获得单个预测,是否必须添加这样的群集大小?

single <- (74/200)* pred$Comp.1[,1] + (126/200)*pred$Comp.2[,2]

I use flexmix for prediction in the following way: 我通过以下方式使用flexmix进行预测:

pred = predict(mod, NPreg)
clust = clusters(mod,NPreg)
result = cbind(NPreg,data.frame(pred),data.frame(clust))
plot(result$yn,col = c("red","blue")[result$clust],pch = 16,ylab = "yn")

NPreg中的簇

And the confusion matrix: 和混淆矩阵:

table(result$class,result$clust)

NPreg的混淆矩阵

For getting the predicted values of yn , I select the component value of the cluster to which a data point belongs. 为了获得yn的预测值,我选择了数据点所属的群集的组件值。

for(i in 1:nrow(result)){
  result$pred_model1[i] = result[,paste0("Comp.",result$clust[i],".1")][i]
  result$pred_model2[i] = result[,paste0("Comp.",result$clust[i],".2")][i]
}

The actual vs predicted results show the fit (adding only one of them here as both of your models are same, you would use pred_model2 for the second model). 实际结果与预测结果显示出拟合度(由于两个模型都相同,因此在此处仅添加其中一个,第二个模型将使用pred_model2 )。

qplot(result$yn, result$pred_model1,xlab="Actual",ylab="Predicted") + geom_abline()

实际与预测

RMSE = sqrt(mean((result$yn-result$pred_model1)^2))

gives a root mean square error of 5.54 . 给出5.54均方根误差。

This answer is based on many SO answers I read through while working with flexmix . 此答案基于我在使用flexmix阅读的许多SO答案。 It worked well for my problem. 它很好地解决了我的问题。

You may also be interested in visualizing the two distributions. 您可能还对可视化这两个分布感兴趣。 My model was the following, which shows some overlap as the ratio of components are not close to 1 . 我的模型如下,由于组件的比率不接近1 ,因此显示出一些重叠。

Call:
flexmix(formula = yn ~ x, data = NPreg, k = 2, 
model = list(FLXMRglm(yn ~ x, family = "gaussian"), 
             FLXMRglm(yn ~ x, family = "gaussian")))

       prior size post>0 ratio
Comp.1 0.481  102    129 0.791
Comp.2 0.519   98    171 0.573

'log Lik.' -1312.127 (df=13)
AIC: 2650.255   BIC: 2693.133 

I also generate a density distribution with histograms to visulaize both components. 我还使用直方图生成密度分布,以对这两个分量进行可视化。 This was inspired by a SO answer from the maintainer of betareg . 这是受betareg维护者的SO 答案启发的。

a = subset(result, clust == 1)
b = subset(result, clust == 2)
hist(a$yn, col = hcl(0, 50, 80), main = "",xlab = "", freq = FALSE, ylim = c(0,0.06))
hist(b$yn, col = hcl(240, 50, 80), add = TRUE,main = "", xlab = "", freq = FALSE, ylim = c(0,0.06))
ys = seq(0, 50, by = 0.1)
lines(ys, dnorm(ys, mean = mean(a$yn), sd = sd(a$yn)), col = hcl(0, 80, 50), lwd = 2)
lines(ys, dnorm(ys, mean = mean(b$yn), sd = sd(b$yn)), col = hcl(240, 80, 50), lwd = 2)

组件密度

# Joint Histogram
p <- prior(mod)
hist(result$yn, freq = FALSE,main = "", xlab = "",ylim = c(0,0.06))
lines(ys, p[1] * dnorm(ys, mean = mean(a$yn), sd = sd(a$yn)) +
        p[2] * dnorm(ys, mean = mean(b$yn), sd = sd(b$yn)))

在此处输入图片说明

您可以将其他参数传递给您的预测调用。

pred <- predict(mod, NPreg, aggregate = TRUE)[[1]][,1]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM