简体   繁体   中英

Alternative for sapply

I am using the following code in R :

df$max_col<- sapply(df$col, function(x) ifelse(x == "", 0, strsplit(as.character(x), "", perl = TRUE)[[1]] %>% as.numeric %>% max(na.rm = T)))

This code basically breaks a string like "123456" and turns it into numeric and returns the max value from it. Now, I have a column full of strings like these and this code runs fine till the data size is low. But when the data size is 25 million rows (which I currently am dealing with) this code becomes very slow. Is there any alternative for this code through which I can get the max value from the string stored in a new column?

An answer based on my comment above (but I've modified the code so that it actually works):

x <- c("123", "224", "221", "1912323", "445")
apply(sapply(1:9, function(p) grepl(p, x)), 1, function(k) max(which(k)))
# the above will work if 0 is never the largest  number in any cell

A more generalized version:

doit <- function(x) apply(sapply(0:9, function(p) grepl(p, x)), 1, function(k) max(which(k)))-1
x <- c("123", "224", "221", "1912323", "445", "000")
doit(x)
# [1] 3 4 2 9 5 0

This is about 3 times faster than the original code using strsplit ... but I'm sure there is room for improvement. Umm ... actually, I'm gonna try with strsplit again:

doit3 <- function(.) sapply(strsplit(.,""), max)
doit3(x)
# [1] "3" "4" "2" "9" "5" "0"

This is about 5 times faster than my previous approach. So the problem was not in sapply or strsplit but in the other components. If you need to convert it to numeric, add as.numeric to the outer layer, this won't take much extra time:

doit4 <- function(.) as.numeric(sapply(strsplit(.,""), max))
> doit4(x)
# [1] 3 4 2 9 5 0

Converting to an integer then computing the digits using %% and %/% appears to be fastest for a 25,000,000 length vector:

a <- as.character(sample(1:1e6, size = 25e6, replace = TRUE))

use_grepl <- function(x) {
  o <- integer(length(x))
  o[grep('1', x, fixed = TRUE)] <- 1L
  o[grep('2', x, fixed = TRUE)] <- 2L
  o[grep('3', x, fixed = TRUE)] <- 3L
  o[grep('4', x, fixed = TRUE)] <- 4L
  o[grep('5', x, fixed = TRUE)] <- 5L
  o[grep('6', x, fixed = TRUE)] <- 6L
  o[grep('7', x, fixed = TRUE)] <- 7L
  o[grep('8', x, fixed = TRUE)] <- 8L
  o[grep('9', x, fixed = TRUE)] <- 9L
  o
}

use_strsplit <- function(x) {
  tbl19 <- as.character(1:9)
  vapply(strsplit(x, split = "", fixed = TRUE),
         function(v) {
           max(fmatch(v, table = tbl19, nomatch = 0L))
         },
         0L)
}

use_mod <- function(xx) {

  nth_digit_of <- function (x, n) {
    {x %% 10^n} %/% 10^{n - 1L}
  }
  v <- as.integer(xx)
  most_digits <- as.integer(ceiling(log10(max(v))) + 1)
  o <- nth_digit_of(v, 1L)
  for (vj in 2:most_digits) {
    o <- pmax.int(o, nth_digit_of(v, vj)) 
  }
  as.integer(o)
}


doit4 <- function(V) as.numeric(sapply(strsplit(V, ""), max))

bench::mark(use_mod(a), use_grepl(a), doit4(a))
# A tibble: 3 x 14
  expression   min  mean median   max `itr/sec` mem_alloc  n_gc n_itr total_time result memory time 
  <chr>      <bch> <bch> <bch:> <bch>     <dbl> <bch:byt> <dbl> <int>   <bch:tm> <list> <list> <lis>
1 use_mod(a) 14.4s 14.4s  14.4s 14.4s    0.0693    2.61GB     3     1      14.4s <int ~ <Rpro~ <bch~
2 use_grepl~ 38.2s 38.2s  38.2s 38.2s    0.0262    1.32GB     0     1      38.2s <int ~ <Rpro~ <bch~
3 doit4(a)   56.5s 56.5s  56.5s 56.5s    0.0177    1.18GB     7     1      56.5s <dbl ~ <Rpro~ <bch~

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM