[英]Running Random Forest in Parallel
我正在使用 R 编程语言。 我正在尝试在一个非常大的数据集(超过 1 亿行)上拟合一个随机森林模型,该数据集具有不平衡的类(即二元响应变量比率为 95% 到 5%)。 为此,我编写的 R 代码:
下面,我已经包含了这些步骤的 R 代码:
第 1 步:为问题创建数据
# Step 1: Randomly create data and make initial training/test set:
library(dplyr)
library(ranger)
original_data = rbind( data_1 = data.frame( class = 1, height = rnorm(10000, 180,10), weight = rnorm(10000, 90,10), salary = rnorm(10000,50000,10000)), data_2 = data.frame(class = 0, height = rnorm(100, 160,10), weight = rnorm(100, 100,10), salary = rnorm(100,40000,10000)) )
original_data$class = as.factor(original_data$class)
original_data$id = 1:nrow(original_data)
test_set= rbind(original_data[ sample( which( original_data$class == "0" ) , replace = FALSE , 30 ) , ], original_data[ sample( which( original_data$class == "1" ) , replace = FALSE, 2000 ) , ])
train_set = anti_join(original_data, test_set)
第 2 步:创建“平衡”随机子集:
# Step 2: Create "Balanced" Random Subsets:
results <- list()
for (i in 1:100)
{
iteration_i = i
sample_i = rbind(train_set[ sample( which( train_set$class == "0" ) , replace = TRUE , 50 ) , ], train_set[ sample( which( train_set$class == "1" ) , replace = TRUE, 60 ) , ])
results_tmp = data.frame(iteration_i, sample_i)
results_tmp$iteration_i = as.factor(results_tmp$iteration_i)
results[[i]] <- results_tmp
}
results_df <- do.call(rbind.data.frame, results)
X<-split(results_df, results_df$iteration)
invisible(lapply(seq_along(results),
function(i,x) {assign(paste0("train_set_",i),x[[i]], envir=.GlobalEnv)},
x=results))
第 3 步:在每个子集上训练模型
# Step 3: Train Models on Each Subset:
#training
wd = getwd()
results_1 <- list()
for (i in 1:100){
model_i <- ranger(class ~ height + weight + salary, data = X[[i]], probability = TRUE)
saveRDS(model_i, paste0("wd", paste("model_", i, ".RDS")))
results_1[[i]] <- model_i
}
第 4 步:组合所有模型并使用组合模型对测试集进行预测:
# Step 4: Combine All Models and Use Combined Model to Make Predictions on the Test Set:
results_2 <- list()
for (i in 1:100){
predict_i <- data.frame(predict(results_1[[i]], data = test_set)$predictions)
predict_i$id = 1:nrow(predict_i)
results_2[[i]] <- predict_i
}
final_predictions = aggregate(.~ id, do.call(rbind, results_2), mean)
我的问题:我想看看我是否可以将“并行计算”合并到第 2 步、第 3 步和第 4 步中,以使我编写的代码运行得更快。 我查阅了其他帖子(例如https://stackoverflow.com/questions/14106010/parallel-execution-of-random-forest-in-r、https://stats.stackexchange.com/questions/519640/parallelizing-random-forest -learning-in-r-changes-the-class-of-the-rf-object ),我想看看我是否可以重新格式化我编写的代码并结合类似的“并行计算”功能来改进我的代码:
library(parallel)
library(doParallel)
library(foreach)
#Try to parallelize
cl <- makeCluster(detectCores()-1)
registerDoParallel(cl)
# Insert Reformatted Step 2 - Step 4 Here:
stopImplicitCluster()
stopCluster(cl)
rm(cl)
但是我对并行计算的世界仍然很陌生,并且仍在试图弄清楚如何重新格式化我的代码以使其能够正常工作。
有人可以告诉我如何做到这一点吗?
笔记:
在我之前咨询过的问题中(例如 R 中随机森林的并行执行, https: //stats.stackexchange.com/questions/519640/parallelizing-random-forest-learning-in-r-changes-the-class-of- the-rf-object ),使用“randomForest”包而不是“ranger”我也愿意使用“randomForest”包,如果这将使并行化更容易。
我承认我的代码的整体结构可能没有以最佳方式编写 - 如果这将使并行化变得更容易,我愿意接受有关重新编写代码的建议。
我意识到 R 中有几个流行的包可用于并行化代码(例如https://cran.r-project.org/web/packages/doSNOW/index.html ) - 我也愿意使用任何这些包用于并行化我的代码。
注意到您对tidymodels
方法的开放性,您可以尝试使用original_data
并包括并行处理:
library(tidyverse)
library(tidymodels)
library(vip)
library(doParallel)
library(tictoc)
library(themis)
registerDoParallel(cores = 6)
# Supplied data
set.seed(2022)
original_data <- rbind(
data_1 = data.frame(
class = 1,
height = rnorm(10000, 180, 10),
weight = rnorm(10000, 90, 10),
salary = rnorm(10000, 50000, 10000)
),
data_2 = data.frame(
class = 0,
height = rnorm(100, 160, 10),
weight = rnorm(100, 100, 10),
salary = rnorm(100, 40000, 10000)
)
)
original_data$class <- as.factor(original_data$class)
original_data$id <- 1:nrow(original_data)
tic()
# Train / test data
set.seed(2022)
data_split <-
original_data |>
initial_split(strata = class) # stratify by class
train_df <- data_split |> training()
test_df <- data_split |> testing()
# Create a pre-processing recipe
class_recipe <-
train_df |>
recipe() |>
update_role(class, new_role = "outcome") |>
update_role(id, new_role = "id") |>
update_role(-has_role("outcome"), -has_role("id"), new_role = "predictor") |>
step_rose(class)
# Check class balance
class_recipe |> prep() |> bake(new_data = NULL) |> count(class)
#> # A tibble: 2 × 2
#> class n
#> <fct> <int>
#> 1 0 7407
#> 2 1 7589
summary(class_recipe)
#> # A tibble: 5 × 4
#> variable type role source
#> <chr> <chr> <chr> <chr>
#> 1 class nominal outcome original
#> 2 height numeric predictor original
#> 3 weight numeric predictor original
#> 4 salary numeric predictor original
#> 5 id numeric id original
# Create model & workflow
ranger_model <-
rand_forest(mtry = tune()) |>
set_engine("ranger", importance = "impurity") |>
set_mode("classification")
ranger_wflow <- workflow() |>
add_recipe(class_recipe) |>
add_model(ranger_model)
# Tune model with 10-fold Cross Validation
set.seed(2022)
folds <- vfold_cv(train_df, v = 10)
set.seed(2022)
ranger_res <- ranger_wflow |>
tune_grid(
resamples = folds,
grid = crossing(
mtry = seq(1, 3, 1),
),
control = control_grid(verbose = TRUE),
metrics = metric_set(accuracy) # choose a metric, e.g. accuracy
)
# Fit model
best_tune <- ranger_res |> select_best()
set.seed(2022)
ranger_fit <- ranger_wflow |>
finalize_workflow(best_tune) %>%
fit(train_df)
# Test
class_results <- ranger_fit |> augment(new_data = test_df)
class_results |> accuracy(class, .pred_class)
#> # A tibble: 1 × 3
#> .metric .estimator .estimate
#> <chr> <chr> <dbl>
#> 1 accuracy binary 0.912
# Visualise feature importance
ranger_fit |>
extract_fit_parsnip() |>
vip() +
labs(title = "Feature Importance -- Ranger")
toc()
#> 62.393 sec elapsed
由reprex 包于 2022-06-21 创建 (v2.0.1)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.