R-巨大data.frame中的變異條件

Question

因此，我有一個非常大的數據集（ >1000 obs. of >15000 variables ），並且我不想將所有>1值都替換為1 ，其余的保持不變。

示例數據：

data <- data.frame(a = 1:10, b = -1:-10, c = letters[1:10])

    a   b c
1   1  -1 a
2   2  -2 b
3   3  -3 c
4   4  -4 d
5   5  -5 e
6   6  -6 f
7   7  -7 g
8   8  -8 h
9   9  -9 i
10 10 -10 j

這是我的dplyr方法：

data %>% mutate_if(is.numeric, 
                                   funs(
                                     case_when(
                                       . >= 1 ~ 1,
                                       TRUE ~ as.double(.))
                                     )
                                   )

這需要花一些時間來處理原始數據。 知道如何加快速度嗎？ data.table ？

Answer 1

這個帶有data.table解決方案似乎可行，公平地說，它給出了warning() ：

library(data.table)
library(purrr)
num_cols <- colnames(data)[map_lgl(data, is.numeric)] # select only the numerics 

data[, (num_cols):= lapply(.SD, function(x) {
                                    x[x>1] = 1
                                    x}),
     .SDcols=num_cols
     ]
data
# a aa   b c
# 1: 1  1  -1 a
# 2: 1  1  -2 b
# 3: 1  1  -3 c
# 4: 1  1  -4 d
# 5: 1  1  -5 e
# 6: 1  1  -6 f
# 7: 1  1  -7 g
# 8: 1  1  -8 h
# 9: 1  1  -9 i
# 10: 1  1 -10 j

警告消息：在[.data.table （data，， := （（num_cols），lapply（.SD，function（x）{）：提供2列以分配值列表（長度3）（未使用1個）

使用的數據：

data <- data.table(a = 1:10, aa = 1:10, b = -1:-10, c = letters[1:10])

基准測試：

microbenchmark::microbenchmark(
  dplyr = data %>% mutate_if(is.numeric, 
                              funs(
                                case_when(
                                  . >= 1 ~ 1,
                                  TRUE ~ as.double(.))
                              )
  ),
  datatable = data[, (num_cols):= lapply(.SD, function(x) {
    x[x>1] = 1
    x})
    ],
  times = 100
)

# Unit: microseconds
# expr      min        lq      mean    median        uq       max neval
# dplyr 1465.088 1644.7690 2012.3148 1775.4730 1989.1065 19992.621   100
# datatable  372.282  399.0235  480.9405  440.0375  547.3055   831.398   100

公平地說，更新Ronak Shah解決方案更快：

microbenchmark::microbenchmark(
  dplyr = data %>% mutate_if(is.numeric, 
                              funs(
                                case_when(
                                  . >= 1 ~ 1,
                                  TRUE ~ as.double(.))
                              )
  ),
  datatable = data[, (num_cols):= lapply(.SD, function(x) {
    x[x>1] = 1
    x})
    ],
  base = {dataframe <- as.data.frame(data)
          dataframe[dataframe > 1] <- 1},
  times = 100
)
# Unit: microseconds
# expr      min        lq      mean   median        uq       max neval
# dplyr 1782.384 1902.1210 2549.3977 1995.116 2099.9800 55628.570   100
# datatable  394.817  422.7605  466.5329  441.690  512.9020   628.282   100
# base  118.987  135.5120  160.1595  154.291  176.2255   300.469   100

Answer 2

你可以試試：

apply(data[, which(sapply(data, is.numeric))], 2, 
      function(x) {ifelse(x > 1, 1, x)})

它省略了c列，但之后可以輕松合並它。

R-巨大data.frame中的變異條件

問題描述

2 個解決方案

解決方案1
1 2018-10-16 09:54:40

解決方案2
0 2018-10-16 10:08:37

R-巨大data.frame中的變異條件

問題描述

2 個解決方案

解決方案1 1 2018-10-16 09:54:40

解決方案2 0 2018-10-16 10:08:37

解決方案1
1 2018-10-16 09:54:40

解決方案2
0 2018-10-16 10:08:37