[英]How to run a gbm simulation over probabilities using caret package model results
I am not sure if this is the right approach, and I'd be happy to be corrected.我不确定这是否是正确的方法,我很乐意得到纠正。
I fitted a gbm model using the caret package in R environment, for sake of an example I will do it without any parameter adjustment and using the iris dataset:我在 R 环境中使用caret包安装了 gbm 模型,作为示例,我将在没有任何参数调整的情况下使用iris数据集来完成:
library(caret)
data(iris)
gbmFit <- train(Species ~ ., data = iris, method = "gbm")
This allows me to classify to what species a flower would belong if I had the 4 measurements in the iris dataset.如果我在 iris 数据集中有 4 个测量值,这使我可以对花属于哪个物种进行分类。
I am interested in using the results of the predict
function in the type = "prob"
to run simulations.我有兴趣使用
type = "prob"
中的predict
函数的结果来运行模拟。
Since this is just an example, I don't have new data, so I will use the same data as if it was new, I used the predict function to get the probabilities that each flower is of one species:由于这只是一个例子,我没有新数据,所以我将使用与新数据相同的数据,我使用 predict 函数来获取每朵花属于一个物种的概率:
PROBS <- predict(gbmFit4, iris[,1:4], type="prob")
this are examples of the result这是结果的例子
head(PROBS)
setosa versicolor virginica
1 0.9999989 1.087268e-06 1.679813e-10
2 0.9999998 1.689137e-07 1.404242e-09
3 0.9999995 5.381312e-07 3.131823e-10
4 0.9999996 4.335414e-07 3.912857e-10
5 0.9999989 1.087268e-06 1.679813e-10
6 0.9999987 1.278968e-06 1.679813e-10
I know how to do a simulation for one flower, where I use the PROB dataframe to give me the probabilities of a flower to be of each species, and then use sample to simulate the classification given that probability, to make, lets say 1000000 classifications, I use row 107 because it is a less certain case:我知道如何对一朵花进行模拟,在那里我使用 PROB 数据框给我一朵花属于每个物种的概率,然后使用样本来模拟给定概率的分类,让我们说 1000000 个分类,我使用第 107 行,因为它不太确定:
set.seed(123)
summary(as.factor(sample(c("setosa", "versicolor", "virginica"), size = 100000, replace = TRUE, prob = PROBS[107,])))
which results in这导致
versicolor virginica
14731 85269
My goal is to run a simulation like this with new data and get the following result: in average how many flowers were classified in each species for each simulation (av_class_species);我的目标是使用新数据运行这样的模拟并得到以下结果:平均每个模拟中每个物种有多少花被分类(av_class_species); and what was the minimum and max for each species (min_class_species, max_class_species), as an example I made this fake data frame (there are 150 flowers in the dataset):
以及每个物种的最小值和最大值是多少(min_class_species、max_class_species),作为示例,我制作了这个假数据框(数据集中有 150 朵花):
av_class_setosa max_class_setosa min_class_setosa av_class_versicolor...
24.4 35 12 30.2
any help would be greatly appreciated任何帮助将不胜感激
Found my own answer, although I would be happy if someone found a more efficient way, this is for running 100 simulations:找到了我自己的答案,虽然如果有人找到更有效的方法我会很高兴,这是运行 100 次模拟:
SIMUL <- list()
for(i in 1:100){
species <- list()
for(j in 1:nrow(PROBS)){
species[[j]] <- sample(c("setosa", "versicolor", "virginica"), size = 1, replace = TRUE, prob = PROBS[j,])
}
SIMUL[[i]] <- as.data.frame(table(unlist(species)))
}
SIMUL <- do.call("rbind", SIMUL)
SIMUL <- dplyr::group_by(SIMUL, Var1)
SIMUL <- dplyr::summarise(SIMUL, MEAN_class = mean(Freq), MIN_Class = min(Freq), MAX_Class = max(Freq))
This will result in:这将导致:
SIMUL
Source: local data frame [3 x 4]
Var1 MEAN_class MIN_Class MAX_Class
(fctr) (dbl) (int) (int)
1 setosa 50.0 50 50
2 versicolor 49.7 47 53
3 virginica 50.3 47 53
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.