简体   繁体   English

有效转换为R中的向量

[英]Efficient conversion to vectors in R

Can anyone help me make this R code more efficient? 任何人都可以帮助我提高这个R代码的效率吗?

I'm trying to write a function that changes a list of strings to a vector of strings, or a list of numbers to a vector of numbers, of lists of typed elements to vectors of a certain type in general. 我正在尝试编写一个函数,将字符串列表更改为字符串向量,或将数字列表更改为数字向量,将类型元素列表更改为某种类型的向量。

I want to able to change lists to a particular type of vector if they have the folllowing properties: 我希望能够将列表更改为特定类型的向量,如果它们具有以下属性:

  1. They are homogenously typed. 它们是均匀打字的。 Every element of the list is of type 'character', or 'complex' or so on. 列表的每个元素都是“字符”类型,或“复杂”等。

  2. Each element of the list is length-one. 列表的每个元素都是长度为1。

     as_atomic <- local({ assert_is_valid_elem <- function (elem, mode) { if (length(elem) != 1 || !is(elem, mode)) { stop("") } TRUE } function (coll, mode) { if (length(coll) == 0) { vector(mode) } else { # check that the generic vector is composed only # of length-one values, and each value has the correct type. # uses more memory that 'for', but is presumably faster. vapply(coll, assert_is_valid_elem, logical(1), mode = mode) as.vector(coll, mode = mode) } } }) 

For example, 例如,

as_atomic(list(1, 2, 3), 'numeric')
as.numeric(c(1,2,3))

# this fails (mixed types)
as_atomic( list(1, 'a', 2), 'character' )
# ERROR.

# this fails (non-length one element)
as_atomic( list(1, c(2,3,4), 5), 'numeric' )
# ERROR.

# this fails (cannot convert numbers to strings)
as_atomic( list(1, 2, 3), 'character' )
# ERROR.

The above code works fine, but it is very slow and I can't see any way to optimise it without changing the behaviour of the function. 上面的代码工作正常,但它很慢,我看不到任何方法来优化它而不改变函数的行为。 It's important the function 'as_atomic' behaves as it does; 重要的是'as_atomic'函数的行为与它一样重要; I can't switch to a base function that I'm familiar with (unlist, for example), since I need to throw an error for bad lists. 我无法切换到我熟悉的基本功能(例如,取消列表),因为我需要为坏列表抛出错误。

require(microbenchmark)

microbenchmark(
    as_atomic( as.list(1:1000), 'numeric'),
    vapply(1:1000, identity, integer(1)),
    unit = 'ns'
)

On my (fairly fast) machine the benchmark has a frequency of about 40Hz, so this function is almost always rate limiting in my code. 在我(相当快)的机器上,基准测试的频率约为40Hz,因此在我的代码中,这个功能几乎总是速率限制。 The vapply control benchmark has a frequency of about 1650Hz, which is still quite slow. vapply控制基准测试的频率约为1650Hz,但仍然很慢。

Is there any way to dramatically improve the efficiency of this operation? 有没有办法大幅提高这项操作的效率? Any advice is appreciated. 任何建议表示赞赏。

If any clarification or edits are needed, please leave a comment below. 如果需要任何澄清或编辑,请在下面留言。

Edit: 编辑:

Hello all, 大家好,

Sorry for the very belated reply; 很抱歉这个迟来的回复; I had exams I needed to get to before I could try re-implement this. 在我尝试重新实现之前,我需要参加考试。

Thank you all for the performance tips. 谢谢大家的性能提示。 I got the performance up from a terrible 40hz to a more acceptable 600hz using plain R code. 我使用简单的R代码将性能从可怕的40hz提升到更可接受的600hz。

The largest speedups was from using typeof or mode instead of is; 最大的加速来自使用typeof或mode而不是; this really sped up the tight inner checking loop. 这真的加快了紧密的内部检查循环。

I'll probably have to bite the bullet and rewrite this in rcpp to get it really performant though. 我可能不得不咬紧牙关并在rcpp中重写它以获得真正高效的功能。

There are two parts to this problem: 这个问题有两个部分:

  1. checking that inputs are valid 检查输入是否有效
  2. coercing a list to a vector 将列表强制转换为向量

Checking valid inputs 检查有效输入

First, I'd avoid is() because it's known to be slow. 首先,我会避免使用is()因为它已知很慢。 That gives: 这给了:

check_valid <- function (elem, mode) {
  if (length(elem) != 1) stop("Must be length 1")
  if (mode(elem) != mode) stop("Not desired type")

  TRUE
}

Now we need to figure out whether a loop or apply variant is faster. 现在我们需要弄清楚循环或应用变量是否更快。 We'll benchmark with the worst possible case where all inputs are valid. 我们将以所有输入有效的最坏情况为基准。

worst <- as.list(0:101)

library(microbenchmark)
options(digits = 3)
microbenchmark(
  `for` = for(i in seq_along(worst)) check_valid(worst[[i]], "numeric"),
  lapply = lapply(worst, check_valid, "numeric"),
  vapply = vapply(worst, check_valid, "numeric", FUN.VALUE = logical(1))
)

## Unit: microseconds
##    expr min  lq median  uq  max neval
##     for 278 293    301 318 1184   100
##  lapply 274 282    291 310 1041   100
##  vapply 273 284    288 298 1062   100

The three methods are basically tied. 这三种方法基本相关。 lapply() is very slightly faster, probably because of the special C tricks that it uses lapply()速度非常快,可能是因为它使用了特殊的C技巧

Coercing list to vector 强制列表向量

Now let's look at a few ways of coercing a list to a vector: 现在让我们看一下将列表强制转换为向量的几种方法:

change_mode <- function(x, mode) {
  mode(x) <- mode
  x
}

microbenchmark(
  change_mode = change_mode(worst, "numeric"),
  unlist = unlist(worst),
  as.vector = as.vector(worst, "numeric")
)

## Unit: microseconds
##         expr   min    lq median   uq    max neval
##  change_mode 19.13 20.83  22.36 23.9 167.51   100
##       unlist  2.42  2.75   3.11  3.3  22.58   100
##    as.vector  1.79  2.13   2.37  2.6   8.05   100

So it looks like you're already using the fastest method, and the total cost is dominated by the check. 所以看起来你已经在使用最快的方法了,总费用由支票支配。

Alternative approach 替代方法

Another idea is that we might be able to get a little faster by looping over the vector once, instead of once to check and once to coerce: 另一个想法是,我们可以通过循环向量一次来获得更快一点,而不是一次检查和一次强制:

as_atomic_for <- function (x, mode) {
  out <- vector(mode, length(x))

  for (i in seq_along(x)) {
    check_valid(x[[i]], mode)
    out[i] <- x[[i]]
  }

  out
}
microbenchmark(
  as_atomic_for(worst, "numeric")
)

## Unit: microseconds
##                             expr min  lq median  uq  max neval
##  as_atomic_for(worst, "numeric") 497 524    557 685 1279   100

That's definitely worse. 那肯定更糟。

All in all, I think this suggests if you want to make this function faster, you should try vectorising the check function in Rcpp. 总而言之,我认为这表明如果你想让这个功能更快,你应该尝试在Rcpp中对检查函数进行矢量化。

Try: 尝试:

as_atomic_2 <- function(x, mode) {
  if(!length(unique(vapply(x, typeof, ""))) == 1L) stop("mixed types")
  as.vector(x, mode)
}
as_atomic_2(list(1, 2, 3), 'numeric')
# [1] 1 2 3
as_atomic_2(list(1, 'a', 2), 'character')
# Error in as_atomic_2(list(1, "a", 2), "character") : mixed types
as_atomic_2(list(1, c(2,3,4), 5), 'numeric' )
# Error in as.vector(x, mode) : 
#   (list) object cannot be coerced to type 'double'

microbenchmark(
  as_atomic( as.list(1:1000), 'numeric'),
  as_atomic_2(as.list(1:1000), 'numeric'),
  vapply(1:1000, identity, integer(1)),
  unit = 'ns'
)    
# Unit: nanoseconds
#                                     expr      min       lq     median 
#    as_atomic(as.list(1:1000), "numeric") 23571781 24059432 24747115.5 
#  as_atomic_2(as.list(1:1000), "numeric")  1008945  1038749  1062153.5 
#     vapply(1:1000, identity, integer(1))   719317   762286   778376.5 

Defining your own function to do the type checking seems to be the bottleneck. 定义自己的函数来进行类型检查似乎是瓶颈。 Using one of the builtin functions gives a large speedup. 使用其中一个内置函数可以提高速度。 However, the call changes somewhat (although it might be possible to change that). 但是,调用有所改变(尽管可能会改变它)。 The examples below are the fastest versions I could come up with: 以下示例是我能提出的最快版本:

As mentioned using is.numeric , is.character gives the largest speedup: 正如使用is.numeric提到的is.numericis.character提供了最大的加速:

as_atomic2 <- function(l, check_type) {
  if (!all(vapply(l, check_type, logical(1)))) stop("")
  r <- unlist(l)
  if (length(r) != length(l)) stop("")
  r
} 

The following is the fastest I could come up with using the original interface: 以下是我使用原始界面提出的最快速度:

as_atomic3 <- function(l, type) {
  if (!all(vapply(l, mode, character(length(type))) == type)) stop("")
  r <- unlist(l)
  if (length(r) != length(l)) stop("")
  r
}

Benchmarking against original: 针对原始基准:

res <- microbenchmark(
    as_atomic( as.list(1:1000), 'numeric'),
    as_atomic2( as.list(1:1000), is.numeric),
    as_atomic3( as.list(1:1000), 'numeric'),
    unit = 'ns'
)
#                                    expr      min         lq     median         uq      max neval
#   as_atomic(as.list(1:1000), "numeric") 13566275 14399729.0 14793812.0 15093380.5 34037349   100
# as_atomic2(as.list(1:1000), is.numeric)   314328   325977.0   346353.5   369852.5   896991   100
#  as_atomic3(as.list(1:1000), "numeric")   856423   899942.5   967705.5  1023238.0  1598593   100

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM