简体   繁体   English

将函数应用于矩阵的每一行,而无需在R中使用lapply函数

[英]Apply a function to each row of a matrix without using lapply function in R

I have a input data frame with multiple rows. 我有一个包含多行的输入数据框。 For each row, I want to apply a function. 对于每一行,我想应用一个函数。 The input data frame has 1,000,000+ rows. 输入数据帧具有1,000,000+行。 How can I speed up the part using lapply ? 如何使用lapply加速零件? I would like to avoid the apply family of functions as in Efficient way to apply function to each row of data frame and return list of data frames because these methods seem to be slow in my case. 我想避免像以有效方式将函数应用于数据帧的每一行并返回数据帧列表那样来应用函数族,因为在我看来,这些方法似乎很慢。

Here is a reproducible example with a simple function: 这是一个具有简单功能的可复制示例:

library(tictoc)   # enable use of tic() and toc() to record time taken for test to compute

func <- function(coord, a, b, c){

  X1 <- as.vector(coord[1])
  Y1 <- as.vector(coord[2])
  X2 <- as.vector(coord[3])
  Y2 <- as.vector(coord[4])

  if(c == 0) {

    res1 <- mean(c((X1 - a) : (X1 - 1), (Y1 + 1) : (Y1 + 40)))
    res2 <- mean(c((X2 - a) : (X2 - 1), (Y2 + 1) : (Y2 + 40)))
    res <- matrix(c(res1, res2), ncol=2, nrow=1)

  } else {

    res1 <- mean(c((X1 - a) : (X1 - 1), (Y1 + 1) : (Y1 + 40)))*b
    res2 <- mean(c((X2 - a) : (X2 - 1), (Y2 + 1) : (Y2 + 40)))*b
    res <- matrix(c(res1, res2), ncol=2, nrow=1)

  }

  return(res)
}

## Apply the function
set.seed(1)
n = 10000000
tab <- as.matrix(data.frame(x1 = sample(1:100, n, replace = T), y1 = sample(1:100, n, replace = T), x2 = sample(1:100, n, replace = T), y2 = sample(1:100, n, replace = T)))


tic("test 1")
test <- do.call("rbind", lapply(split(tab, 1:nrow(tab)),
                                function(x) func(coord = x,
                                                 a = 40,
                                                 b = 5,
                                                 c = 1)))
toc()



 ## test 1: 453.76 sec elapsed

This seems like a good opportunity to refactor and make this in a vectorized calculation, which R can solve faster. 这似乎是一个很好的机会,可以重构并在矢量化计算中完成,R可以更快地求解。 (TL;DR: this makes it about 1000x faster.) (TL; DR:这使速度提高了约1000倍。)

It looks like the task here is to take a weighted average of two ranges of integers, where the bookends of the ranges vary by row (based on X1, X2, Y1, and Y2), but the sequences are the same length in each row. 看起来这里的任务是取两个整数范围的加权平均值,其中范围的书挡随行而变化(基于X1,X2,Y1和Y2),但是每行的序列长度相同。 This helps, because it means we can use algebra to simplify the calculation. 这很有帮助,因为这意味着我们可以使用代数来简化计算。

For the simple case that a = 40, the first sequence will be from x1-40 to x-1, and from y+1 to y1+40. 对于简单的情况,即a = 40,第一个序列将从x1-40到x-1,从y + 1到y1 + 40。 The mean will be the sum of these two divided by 80. The sum will be 40*X1 + 40*Y1 + sum of (-40:-1) + sum of (1:40), and those last two terms cancel out. 平均值将是这两项的总和除以80。总和将为40 * X1 + 40 * Y1 +(-40:-1)的总和(1:40)的总和,最后两项相抵消。 So you can simply output the average of each pair of columns, multiplied by b. 因此,您可以简单地输出每对列的平均值乘以b。

library(dplyr)
b = 5
quick_test <- tab_tbl %>%
  as_data_frame() %>%
  mutate(V1 = (x1+y1)/2 * b,
         V2 = (x2+y2)/2 * b)

Using n = 1E6 (10% of OP), the OP function takes 73 seconds. 使用n = 1E6(OP的10%),OP功能需要73秒。 The function above takes 0.08 seconds and has the same output. 上面的功能需要0.08秒的时间,并且具有相同的输出。

For the cases where a != 40 , it takes a little more algebra. 对于a != 40 ,它需要更多的代数。 V1 here ends up as a weighted average, where we're adding up the sequence (x1-a):(x1-1) and the sequence (y1+1):(y1+40) , all divided by a+40 (since there are a terms in the x1 sequence and 40 terms in the y1 sequence. We don't actually need to add up this sequence; we could convert it to a shorter calculation using algebra: https://en.wikipedia.org/wiki/Arithmetic_progression V1在这里以加权平均值结束,我们将序列(x1-a):(x1-1)和序列(y1+1):(y1+40) (x1-a):(x1-1) ,然后除以a+40 (因为在x1序列中有a项,在y1序列中有40个项,所以我们实际上不需要将这个序列相加;我们可以使用代数将其转换为较短的计算: https : //en.wikipedia.org/ Wiki /算术进度

sum of (x1-a):(x1-1) = x1*a + sum of (-a:-1) = x1*a + a*(-a + -1)/2 = x1*a - (a*a + a)/2 sum of (x1-a):(x1-1) = x1*a + sum of (-a:-1) = x1*a + a*(-a + -1)/2 = x1*a - (a*a + a)/2 sum of (x1-a):(x1-1) x1*a - (a*a + a)/2

That all means we can fully replicate the code for any positive a using: 这一切都意味着我们可以使用以下任何一个正a完全复制代码:

a = 50
b = 5

tictoc::tic("test 2b")
quick_test2 <- quick_test <- tab %>%
  as_data_frame() %>%
  mutate(V1 = (a*x1 - (a*a + a)/2  + 40*y1 + 820)/(a+40)*b,
         V2 = (a*x2 - (a*a + a)/2  + 40*y2 + 820)/(a+40)*b)
tictoc::toc()

This is about 1000x faster. 这快了大约1000倍。 With n = 1E6, a = 41, b = 5, c = 1, the OP solution took 154 seconds on my 2012 laptop, while quick_test2 above took 0.23 sec and had identical results. 在n = 1E6,a = 41,b = 5,c = 1的情况下,OP解决方案在我的2012年笔记本电脑上花费了154秒,而上述quick_test2花费了0.23秒,并且结果相同。

(Small addendum, you could add a test to set b = 1 if c == 0, and then you've taken care of the if-else condition.) (小附录,如果c == 0,则可以添加一个测试以将b = 1设置为零,然后就可以处理if-else条件。)

Based on Jon Spring answer, we can do the same with base R: 根据乔恩·斯普林(Jon Spring)的答案,我们可以对基数R进行相同的操作:

test2 <- function(d, a, b, c) {
  if (c == 0) b <- 1
  X <- d[, c('x1', 'x2')]
  Y <- d[, c('y1', 'y2')]
  (a*X - (a*a + a)/2  + 40*Y + 820)/(a+40)*b
}

res2 <- test2(tab, 40, 5, 1)

Looks like some already very fast options. 看起来有些已经非常快的选项。 Another slow option would be a standard for-loop . 另一个较慢的选择是标准的for-loop

This is much slower than theirs, but still 3 times faster than the lapply . 这比他们的慢得多,但仍然比lapply快3倍。

n = 1e6

tic("test 2")
test <- vector("list", nrow(tab))
for (i in 1:nrow(tab)) {test[[i]] <- func(coord = tab[i,], a = 40, b = 5, c = 1)
}
testout <- do.call(rbind, test)
toc()

> test 2: 3.85 sec elapsed

I suggest looking up the tidyverse, in this case specifically dplyr (a tidyverse sub-package). 我建议查找tidyverse,在本例中为dplyr(tidyverse子程序包)。 The tidyverse is a huge collection of useful and "tidy" (aka, FAST) operations. tidyverse是大量有用和“整洁”(又名FAST)操作的集合。 Once you go tidy, you never go back. 一旦收拾整齐,就永远不会回头。

First, just some general math advice. 首先,只是一些一般的数学建议。 Taking an average of a sequence can be done without actually generating the entire sequence. 可以对序列取平均值,而无需实际生成整个序列。 You just need the start and end of the sequence, as the average of the first and last number is the same as the average of the entire sequence. 您只需要序列的开始和结束,因为第一个和最后一个数字的平均值与整个序列的平均值相同。 If your real data is a vector of non-sequential numbers let me know. 如果您的真实数据是非序数的向量,请告诉我。 The following three lines of code are a proof that the mean of the first and last number are the same as the mean of the full sequence: 以下三行代码证明了第一个和最后一个数字的均值与整个序列的均值相同:

seqstart <- sample(1:50, 1, replace = T)
seqend <- sample(51:100, 1, replace = T)
mean(c(seqstart, seqend)) == mean(seqstart:seqend)

If you don't believe me, paste those 3 lines into your consule until you find a FALSE value, or until you believe me. 如果您不相信我,则将这三行粘贴到领事中,直到找到FALSE值,或者直到您相信我。 :) :)

library(tidyverse)
set.seed(1)
n = 10000000
tab <- data.frame(x1 = sample(1:100, n, replace = T), y1 = sample(1:100, n, 
replace = T), x2 = sample(1:100, n, replace = T), y2 = sample(1:100, n, replace = 
T))

Notice I am not using a matrix yet. 注意,我还没有使用矩阵。 You can recreate your matrix later. 您可以稍后重新创建矩阵。 If you are starting with a matrix for some reason, honestly I would just change it to a normal table for this so I can use tidy operations more easily. 如果出于某种原因从矩阵开始,说实话,我会为此将其更改为普通表,这样我就可以更轻松地使用整洁的操作。 Maybe a guru can teach us how to use tidyverse operations on matrices, I don't know how. 也许一位上师可以教我们如何在矩阵上使用tidyverse运算,但我不知道该怎么做。 Solution: 解:

tic("test 1")
a <- 40
b <- 5
test <- tab %>% mutate(c = 1) %>%
mutate(res1 = if_else(c==1,(((x1 - a)+(x1 - 1)+(y1 + 1)+(y1 + 40))/4)*b,(((x1 - a)+ 
(x1 - 1)+(y1 + 1)+(y1 + 40))/4))) %>%
mutate(res2 = if_else(c==1,(((x2 - a)+(x2 - 1)+(y2 + 1)+(y2 + 40))/4)*b,(((x2 - a)+ 
(x2 - 1)+(y2 + 1)+(y2 + 40))/4)))
test %>% select(res1,res2) -> test
toc()

test 1: 8.91 sec elapsed Fast enough for me. 测试1:经过了8.91秒对我来说足够快了。

Please note I made a new column with mutate called "c" and set it to 1. This is because dplyr doesn't like it if you use if_else statements that have logical checks against an environmental variable (and if that variable is always 1, why would we code this in the first place?). 请注意,我创建了一个名为mutate的新列,并将其设置为1。这是因为如果您使用if_else语句对环境变量进行逻辑检查(并且该变量始终为1,则dplyr不喜欢它)为什么我们要首先对此进行编码?)。 Thus, I am assuming that you are planning to use a "c" that can sometimes be 1 and sometimes be 0, and I am proposing here that you should have that data in a column that we can reference. 因此,我假设您打算使用有时可以为1有时为0的“ c”,并且我在这里建议您将这些数据放在我们可以引用的列中。

@Jon Spring has provided a really good answer above. @Jon Spring在上面提供了一个非常好的答案。

However, I am suggesting a method which is using {data.table}. 但是,我建议一种使用{data.table}的方法。

test2 <- data.table(copy(tab))
tic("test2")
a <- 40
b <- 5
c <- 1
test2[, Output1 := (x1*a - 0.5*(a + a^2) + 40 * y1 + 820)/ (a + 40) * b]
test2[, Output2 := (x2*a - 0.5*(a + a^2) + 40 * y2 + 820)/ (a + 40) * b]
toc()

This method takes time of around 0.4 to 3.28 seconds on my laptop, when n = 1e7. 当n = 1e7时,此方法在笔记本电脑上花费的时间约为0.4到3.28秒。

For n = 1e6, the method you posted in question takes around 138 seconds, while the method I used takes about 0.3 seconds. 对于n = 1e6,您发布的方法大约需要138秒,而我使用的方法大约需要0.3秒。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM