[英]dplyr mutate based on columns in a vector
假設我有一個看起來像這樣的數據框:
R1 R2 R3 ... R99 R100
-1 -1 2 ... 3 57
45 -1 -1 ... -1 37
我想創建一個實現以下邏輯的新列:如果mycols
指定的列中的所有值mycols
等於-1
, mycols
TRUE
,否則為FALSE
。 因此,如果我將mycols <- c("R2", "R3", "R99")
設置mycols <- c("R2", "R3", "R99")
,則結果將是
somefeature
FALSE
TRUE
另一方面,如果我將mycols <- c("R1", "R2")
設置mycols <- c("R1", "R2")
,則結果將是
somefeature
TRUE
FALSE
對於一般的mycols
怎么mycols
? 我更喜歡使用dplyr的解決方案。 另外,我希望能夠在手術后保留所有列。
更新:為了決定接受哪種解決方案,我決定比較所有方法的性能:
library(tidyverse)
library(purrr)
library(microbenchmark)
set.seed(42)
n <- 1e4
p <- 100
x <- runif(n*p); x[x < 0.8] <- -1
col_no <- paste0("R", rep(seq(1, p), n))
id <- rep(1:n, each = p)
df <- data.frame(id, x, col_no)
df <- df %>% spread(col_no, x)
foo <- function(df, mycols) {
bind_cols(df, somefeature = df %>%
select(mycols) %>%
rowwise() %>%
do( (.) %>% as.data.frame %>%
mutate(temp = all(. == -1))) %>%
pull(temp))
}
bar <- function(df, mycols) {
df$somefeature = rowSums(df[mycols] != -1) == 0
df
}
baz <- function(df, mycols) {
df %>%
mutate(somefeature = map(.[mycols], `==`, -1) %>%
reduce(`+`) %>%
{. == length(mycols) })
}
mycols <- paste0("R", c(1:50))
res1 <- foo(df, mycols) # Takes roughly a minute on my machine
res2 <- bar(df, mycols)
res3 <- baz(df, mycols)
# Verify all methods give the same solution
stopifnot(ncol(res1) == ncol(res2))
stopifnot(ncol(res1) == ncol(res3))
stopifnot(all(res1$somefeature == res2$somefeature))
stopifnot(all(res1$somefeature == res3$somefeature))
# Time the methods (not foo, as it is much slower than the other two)
microbenchmark(bar(df, mycols), baz(df, mycols))
Unit: milliseconds
expr min lq mean median uq max neval
bar(df, mycols) 3.926076 5.534273 6.782348 6.468424 7.019863 30.70699 100
baz(df, mycols) 8.289160 9.598482 11.726803 10.208659 10.909052 72.72334 100
基本R解決方案是最快的。 但是,我確實指定了我想使用tidyverse,因此我決定接受提供最快的基於dydyverse的解決方案的解決方案。
使用rowSums
快速基礎R解決方案
mycols <- c("R2", "R3", "R99")
rowSums(df[mycols] != -1) == 0
#[1] FALSE TRUE
這也可以寫成
rowSums(df[mycols] == -1) == length(mycols)
#[1] FALSE TRUE
但是,如果你喜歡dplyr
使用一種方法rowwise
和do
會
library(dplyr)
bind_cols(df, somefeature = df %>%
select(mycols) %>%
rowwise() %>%
do( (.) %>% as.data.frame %>%
mutate(temp = all(. == -1))) %>%
pull(temp))
# R1 R2 R3 R99 R100 somefeature
#1 -1 21 2 3 57 FALSE
#2 45 -1 -1 -1 37 TRUE
這是tidyverse
一個選項。 創建一個函數以供重復使用。 使用map
(from purrr
)循環遍歷'nameVec'中指定的列子集,創建邏輯向量list
,通過求和reduce
其reduce
為單個向量,並檢查其是否等於'nameVec'的length
library(tidyverse)
mycols <- c("R2", "R3", "R99")
f1 <- function(dat, nameVec){
dat %>%
mutate(somefeature = map(.[nameVec], `==`, -1) %>%
reduce(`+`) %>%
{. == length(nameVec) })
}
f1(df1, mycols)
# R1 R2 R3 R99 R100 somefeature
#1 -1 -1 2 3 57 FALSE
#2 45 -1 -1 -1 37 TRUE
mycols <- c("R1", "R2")
f1(df1, mycols)
# R1 R2 R3 R99 R100 somefeature
#1 -1 -1 2 3 57 TRUE
#2 45 -1 -1 -1 37 FALSE
df1 <- structure(list(R1 = c(-1L, 45L), R2 = c(-1L, -1L), R3 = c(2L,
-1L), R99 = c(3L, -1L), R100 = c(57L, 37L)), class = "data.frame",
row.names = c(NA, -2L))
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.