在R中創建許多ROC曲線？

Question

我對150列標簽（1/0）有150列分數。 我的目標是創造150個AUC分數。

這是一個手動示例：

auc(roc(df$label, df$col1)),
auc(roc(df$label, df$col2)),

...

我可以在這里使用Map / sapply / lapply但是還有其他任何方法或函數嗎？

Answer 1

這是一個XY問題。 你真正想要實現的是加快計算速度。 gfgm的答案用並行化來回答它，但這只是一種方法。

如果我假設您正在使用library(pROC)的roc / auc函數，則可以通過為數據集選擇合適的算法來獲得更快的速度。

pROC基本上有兩種算法，根據數據集的特征，它們的擴展方式會有很大差異。 您可以通過將algorithm=0傳遞給roc來測試哪一個是最快的：

# generate some toy data
label <- rbinom(600000, 1, 0.5)
score <- rpois(600000, 10)

library(pROC)
roc(label, score, algorithm=0)
Starting benchmark of algorithms 2 and 3, 10 iterations...
  expr        min         lq       mean     median        uq      max neval
2    2 4805.58762 5827.75410 5910.40251 6036.52975 6085.8416 6620.733    10
3    3   98.46237   99.05378   99.52434   99.12077  100.0773  101.363    10
Selecting algorithm 3.

在這里，我們選擇算法3，當閾值數量保持較低時，算法3會閃耀。 但如果600000個數據點需要5分鍾來計算，我強烈懷疑您的數據是非常連續的（沒有相同值的測量值）並且您擁有與數據點（600000）一樣多的閾值。 在這種情況下，您可以直接跳到算法2，隨着ROC曲線中閾值數量的增加，算法2的擴展性會更好。

然后你可以運行：

auc(roc(df$label, df$col1, algorithm=2)),
auc(roc(df$label, df$col2, algorithm=2)),

在我的機器上，每次調用roc大約需要5秒，這與閾值的數量無關。 這樣你就可以在不到15分鍾的時間內完成。 除非你有50個或更多核心，否則這將比並行化更快。 但當然你可以做到這兩點......

Answer 2

如果要並行化計算，可以這樣做：

# generate some toy data
label <- rbinom(1000, 1, .5)
scores <- matrix(runif(1000*150), ncol = 150)
df <- data.frame(label, scores)

library(pROC)
library(parallel)

auc(roc(df$label, df$X1))
#> Area under the curve: 0.5103

auc_res <- mclapply(df[,2:ncol(df)], function(row){auc(roc(df$label, row))})
head(auc_res)
#> $X1
#> Area under the curve: 0.5103
#> 
#> $X2
#> Area under the curve: 0.5235
#> 
#> $X3
#> Area under the curve: 0.5181
#> 
#> $X4
#> Area under the curve: 0.5119
#> 
#> $X5
#> Area under the curve: 0.5083
#> 
#> $X6
#> Area under the curve: 0.5159

由於大多數計算時間似乎是對auc(roc(...))的調用，如果你有一台多核機器，這應該可以加快速度。

Answer 3

在cutpointr包中有一個功能。 它還會計算分界點和其他指標，但您可以放棄它們。 默認情況下，它會嘗試除響應列之外的所有列作為預測變量。 此外，您可以選擇是否通過省略direction或手動設置ROC曲線的方向（無論較大值是暗示正類還是direction ）來自動確定。

dat <- iris[1:100, ]
library(tidyverse)
library(cutpointr)
mc <- multi_cutpointr(data = dat, class = "Species", pos_class = "versicolor", 
                silent = FALSE)
mc %>% select(variable, direction, AUC)

# A tibble: 4 x 3
  variable     direction   AUC
  <chr>        <chr>     <dbl>
1 Sepal.Length >=        0.933
2 Sepal.Width  <=        0.925
3 Petal.Length >=        1.00 
4 Petal.Width  >=        1.00

順便說一句，運行時不應該在這里是一個問題，因為在計算ROC曲線（甚至包括分割點）花費不到一秒鍾的一個變量和使用百萬觀察cutpointr或ROCR ，所以你的任務在約一或運行2分鍾。

如果內存是限制因素，並行化可能會使問題變得更糟。 如果上面的解決方案占用太多內存，因為它在刪除這些列之前返回所有變量的ROC曲線，您可以嘗試在調用map立即選擇感興趣的列：

# 600.000 observations for 150 variables and a binary outcome

predictors <- matrix(data = rnorm(150 * 6e5), ncol = 150)
dat <- as.data.frame(cbind(y = sample(0:1, size = 6e5, replace = T), predictors))

library(cutpointr)
library(tidyverse)

vars <- colnames(dat)[colnames(dat) != "y"]
result <- map_df(vars, function(coln) {
    cutpointr_(dat, x = coln, class = "y", silent = TRUE, pos_class = 1) %>%
        select(direction, AUC) %>%
        mutate(variable = coln)
})

result

# A tibble: 150 x 3
   direction   AUC variable
   <chr>     <dbl> <chr>   
 1 >=        0.500 V2      
 2 <=        0.501 V3      
 3 >=        0.501 V4      
 4 >=        0.501 V5      
 5 <=        0.501 V6      
 6 <=        0.500 V7      
 7 <=        0.500 V8      
 8 >=        0.502 V9      
 9 >=        0.501 V10     
10 <=        0.500 V11     
# ... with 140 more rows

在R中創建許多ROC曲線？

問題描述

3 個解決方案

解決方案1
6 已采納 2018-04-16 21:25:32

解決方案2
4 2018-04-16 07:04:19

解決方案3
3 2018-04-16 19:28:09

在R中創建許多ROC曲線？

問題描述

3 個解決方案

解決方案1 6 已采納 2018-04-16 21:25:32

解決方案2 4 2018-04-16 07:04:19

解決方案3 3 2018-04-16 19:28:09

解決方案1
6 已采納 2018-04-16 21:25:32

解決方案2
4 2018-04-16 07:04:19

解決方案3
3 2018-04-16 19:28:09