简体   繁体   中英

Repeated Simulation of New Data Prediction with Tidymodels (Parsnip XGboost)

I have a model, called predictive_fit <- fit(workflow, training) that classifies the Iris dataset species using xgboost. The data are pivoted wide such that each species is a dummied column represented by a 0 or 1. Here, I am trying to predict Virginica based on the Sepal and Petal columns.

Currently, I have the following code which then takes the dataset after the model has been fit to test if it can accurately predict the Virginia species of iris. (Snippet below)

testing_data <-
    test %>%
    bind_cols(
        predict(predictive_fit, test)
    )

I cannot, however, figure out how to scale this up with simulation. If I have another dataset with exactly the same structure, I would like to predict whether it is Virginica 100 times. (Snippet below)

new_iris_data <-
    new_iris_data %>%
    bind_cols(
        replicate(n = 100, predict(predictive_fit, new_iris_data))
    )

However, it looks as if when I run the new data the same predictions are just being copied 100 times. What is the appropriate way to repeatedly predict the classification? I wouldn't expect that all 100 times the model would predict exactly the same thing, but I'd like some way to have the predictions run n number of times so each and every row of new data can have its own proportion calculated.

I have already tried using the replicate() function to try this. However, it appears as if it copies the same exact results 100 times. I considered having a for loop that iterated through a different seed and then ran the predictions, but I was hoping for a more performant solution out there.

You are replicating the prediction of you model, not the data.frame you call new_iris_data , and the result is exactly that. In order to replicate a (random) part of the iris dataset, try this:

> data("iris")
> 
> sample <- sample(nrow(iris), floor(nrow(iris) * 0.5))
> 
> train <- iris[sample,]
> test <- iris[-sample,]
> 
> new_test <- replicate(100, test, simplify = FALSE)
> new_test <- Reduce(rbind.data.frame, new_test)
> 
> head(new_test)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa
8          5.0         3.4          1.5         0.2  setosa
9          4.4         2.9          1.4         0.2  setosa
> nrow(new_test)
[1] 7500

The you can use the new_test in any prediction, independent of the model.

If you want 100 differents random parts of the data set, you need to drop the replicate function and do something like:

> new_test <- lapply(1:100, function(x) {
+   sample <- sample(nrow(iris), floor(nrow(iris) * 0.5))
+   iris[-sample,]
+ })
> 
> new_test <- Reduce(rbind.data.frame, new_test)
> 
> head(new_test)
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1           5.1         3.5          1.4         0.2  setosa
7           4.6         3.4          1.4         0.3  setosa
10          4.9         3.1          1.5         0.1  setosa
11          5.4         3.7          1.5         0.2  setosa
13          4.8         3.0          1.4         0.1  setosa
18          5.1         3.5          1.4         0.3  setosa
> nrow(new_test)
[1] 7500
> 

Hope it helps.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM