在R中使用randomForest遍歷變量的值

Question

我一直在嘗試為不同的值運行randomForest模型。 我習慣在STATA中使用“ foreach”命令-但似乎R的工作原理與眾不同。

我已經搜索了很長時間，但都沒有成功，因為這很簡單（我認為）。 這是我想做的事情：

我正在運行以下randomForest模型：

modelRandom = randomForest(y~a+b+c+d+e, data=dataframe, mtry=4, ntree=30)

現在-在此之后，我想像這樣預測每個觀察的概率：

Prob<-predict(modelRandom, dataframe, type = 'prob')

現在出現問題：我想遍歷randomForest模型中變量（b）的值，並預測每個值的概率。

此（b）變量包含十二個不同的值（1:12）。 我希望R將每個觀察值的b變量更改為1並預測概率，然后再次將所有觀察值的b變量更改為2預測概率。 然后到3、4、5，依此類推。

然后，應將所有這些概率放入一個表中，並在其后放置相應的變量c，如下所示：

C prob1 prob2 prob3 prob4 prob5 prob6 prob7 prob8 prob9 prob10 prob11 prob12

我想要在其中插入C，否則我無法分辨出概率屬於哪個觀察值。

我已經提出了這個建議，但是我認為我離想要的目標還很遙遠：

for(b in dataframe){
prob[b]<-predict(modelRandom, dataframe, type = 'prob')
}

如所要求的，這里是有關數據集的更多信息。 我掩蓋了其中的一些內容，因為它包含我無法共享的客戶信息。

structure(list(X = c("NVT", "NVT", "NVT", "NVT", "NVT", 
"NVT"), a = structure(c(1L, 2L, 1L, 1L, 2L, 2L), .Label = c("0", 
"1"), class = "factor"), d= structure(c(2L, 2L, 1L, 1L, 1L, 2L), .Label = c("Dhr.", 
"Mevr."), class = "factor"), c = c("3331GE", "2285EH", 
"9401GE", "5591DZ", "2611CE", "1359KB"), b = structure(c(12L, 
12L, 12L, 12L, 12L, 12L), .Label = c("1", "2", "3", "4", "5", 
"6", "7", "8", "9", "10", "11", "12"), class = "factor"), e = structure(c(5L, 
6L, 5L, 5L, 5L, 5L), .Label = c("1", "2", "3", "4", "5", "6", 
"7", "8"), class = "factor"), .Names = c("X", "a", "d", "c", "b", "e"), row.names = c(NA, 
6L), class = "data.frame")

謝謝！

Answer 1

這是一個帶有較大數據池的示例，因為您提供的數據池無法用於構建模型：

首先模擬一些數據：

r_data <- data.frame(y = as.factor(sample(0:1, 100, replace =T)), 
                     matrix(rnorm(1000), 100),
                     b = sample(1:12, 100, replace = T))

提取行名稱：

names_rows <- rownames(r_data)

在這里，我們將y作為二進制結果，
10個數字特征X1-X10，
和b的值從1到12

制作模型：

library(randomForest)
modelRandom <- randomForest(y~., data = r_data, mtry = 4, ntree = 30)

通過復制數字特征12次並添加b- 1:12所有值來制作新的預測數據

n_row <- nrow(r_data)

newdata <- data.frame(r_data[rep(1:n_row, 12), 2:11], b = rep(1:12, each =  n_row))

獲取有關新數據的預測並從上方綁定b列

preds <- data.frame(predict(modelRandom, newdata, type = 'prob'),
                    b = rep(1:12, each = n_row),
                    names_rows = as.numeric(rep(names_rows, times = 12)))

清理成所需的輸出：

library(tidyverse)

preds %>%
  select(X1, b, names_rows) %>% #select only prob for outcome 1 and the b column
  group_by(b)  %>%
  mutate(z = 1 :  n_row) %>% #generate unique row identifier 
  spread(b, X1) %>% #convert to wide format
  select(-z) #remove unique row identifier 
    #output:

# A tibble: 100 x 13
   names_rows        `1`        `2`        `3`        `4`       `5`        `6`
 *      <dbl>      <dbl>      <dbl>      <dbl>      <dbl>     <dbl>      <dbl>
 1          1 0.30000000 0.30000000 0.30000000 0.30000000 0.3000000 0.30000000
 2          2 0.70000000 0.70000000 0.73333333 0.73333333 0.7000000 0.70000000
 3          3 0.23333333 0.23333333 0.23333333 0.23333333 0.2000000 0.20000000
 4          4 0.33333333 0.30000000 0.26666667 0.26666667 0.3000000 0.26666667
 5          5 0.30000000 0.30000000 0.33333333 0.30000000 0.3000000 0.26666667
 6          6 0.23333333 0.20000000 0.16666667 0.16666667 0.2000000 0.16666667
 7          7 0.06666667 0.06666667 0.06666667 0.06666667 0.1000000 0.06666667
 8          8 0.26666667 0.23333333 0.20000000 0.20000000 0.1666667 0.16666667
 9          9 0.20000000 0.20000000 0.16666667 0.10000000 0.1000000 0.10000000
10         10 0.83333333 0.83333333 0.90000000 0.83333333 0.8333333 0.86666667
# ... with 90 more rows, and 6 more variables: `7` <dbl>, `8` <dbl>, `9` <dbl>,
#   `10` <dbl>, `11` <dbl>, `12` <dbl>

將其保存在對象中：

preds %>%
  select(X1, b, names_rows) %>% column
  group_by(b)  %>%
  mutate(z = 1 :  n_row) %>%
  spread(b, X1) %>% 
  select(-z) -> saved_object

在R中使用randomForest遍歷變量的值

問題描述

1 個解決方案

解決方案1
0 2017-11-22 11:44:44

在R中使用randomForest遍歷變量的值

問題描述

1 個解決方案

解決方案1 0 2017-11-22 11:44:44

解決方案1
0 2017-11-22 11:44:44