简体   繁体   English

从data.frame中提取列比从矩阵中提取更快 - 为什么?

[英]Extract column from data.frame faster than from matrix - why?

I'm running a simulation where I need to repeatedly extract 1 column from a matrix and check each of its values against some condition (eg < 10). 我正在运行模拟,我需要从矩阵中重复提取1列,并根据某些条件检查每个值(例如<10)。 However, doing so with a matrix is 3 times slower than doing the same thing with a data.frame. 但是,使用矩阵执行此操作比使用data.frame执行相同操作慢3倍。 Why is this the case? 为什么会这样?

I'd like to to use matrixes to store the simulation data because they are faster for some other operations (eg updating columns by adding/subtracting values). 我想使用矩阵来存储模拟数据,因为它们对于其他一些操作来说更快(例如通过添加/减去值来更新列)。 How can I extract columns / subset a matrix in a faster way? 如何以更快的方式提取列/子集矩阵?

Extract column from data.frame vs matrix: 从data.frame vs matrix中提取列:

df <- data.frame(a = 1:1e4)
m <- as.matrix(df)

library(microbenchmark)
microbenchmark(
  df$a, 
  m[ , "a"])

# Results; Unit: microseconds
#      expr    min      lq     mean median      uq     max neval cld
#      df$a  5.463  5.8315  8.03997  6.612  8.0275  57.637   100   a 
# m[ , "a"] 64.699 66.6265 72.43631 73.759 75.5595 117.922   100   b

Extract single value from data.frame vs matrix: 从data.frame vs matrix中提取单个值:

microbenchmark(
  df[1, 1],
  df$a[1],
  m[1, 1], 
  m[ , "a"][1])  

# Results; Unit: nanoseconds
#         expr   min      lq     mean  median      uq    max neval  cld
#     df[1, 1]  8248  8753.0 10198.56  9818.5 10689.5  48159   100    c 
#      df$a[1]  4072  4416.0  5247.67  5057.5  5754.5  17993   100    b  
#      m[1, 1]   517   708.5   828.04   810.0   920.5   2732   100    a   
# m[ , "a"][1] 45745 47884.0 51861.90 49100.5 54831.5 105323   100    d

I expected the matrix column extraction to be faster, but it was slower. 我期望矩阵列提取更快,但速度更慢。 However, extracting a single value from a matrix (ie m[1, 1] ) was faster than both of the ways of doing so with a data.frame. 但是,从矩阵中提取单个值(即m[1, 1] )比使用data.frame这两种方式更快。 I'm lost as to why this is. 我迷失了为什么会这样。

Extract row vs column, data.frame vs matrix: 提取行与列,data.frame与矩阵:

The above is only true for selecting columns. 以上仅适用于选择列。 When selecting rows, matrices are much faster than data.frames. 选择行时,矩阵比data.frames快得多。 Still don't know why. 还是不知道为什么。

microbenchmark(
  df[1, ],
  m[1, ],
  df[ , 1],
  m[ , 1])

# Result: Unit: nanoseconds
#     expr   min      lq     mean  median      uq   max neval  cld
#  df[1, ] 16359 17243.5 18766.93 17860.5 19849.5 42973   100    c 
#   m[1, ]   718   999.5  1175.95  1181.0  1327.0  3595   100    a   
# df[ , 1]  7664  8687.5  9888.57  9301.0 10535.5 42312   100    b  
#  m[ , 1] 64874 66218.5 72074.93 73717.5 74084.5 97827   100    d

data.frame data.frame

Consider the builtin data frame BOD . 考虑内置数据帧BOD data frames are stored as a list of columns and the inspect output shown below shows the address of each of the two columns of BOD . 数据帧存储为列列表,下面显示的inspect输出显示BOD的两列中的每一列的地址。 We then assign its second column to BOD2 . 然后我们将其第二列分配给BOD2 Note that the address of BOD2 is the same memory location as the second column shown in the inspect output for BOD . 请注意, BOD2的地址与BODinspect输出中显示的第二列的内存位置相同。 That is, all R did was have BOD2 point to memory within BOD in order to create BOD2 . 也就是说,所有R做的都是BOD2指向BOD内的内存以便创建BOD2 There was no data movement at all. 根本没有数据移动。 Another way to see this is to compare the size of BOD , BOD2 and both together and we see that both together take up the same amount of memory as BOD so there must have been no copying. 另一种看待这种情况的方法是将BODBOD2BOD的大小进行比较,我们发现它们共同占用了与BOD相同的内存量,因此必须没有复制。 (Continued after code.) (继续代码。)

library(pryr)

BOD2 <- BOD[[2]]
inspect(BOD)
## <VECSXP 0x507c278>
##   <REALSXP 0x4f81f48>
##   <REALSXP 0x4f81ed8>  <--- compare this address to address shown below
## ...snip...

BOD2 <- BOD[,2]
address(BOD2)
## [1] "0x4f81ed8"

object_size(BOD)
## 1.18 kB
object_size(BOD2)
## 96 B
object_size(BOD, BOD2)    # same as object_size(BOD) above
## 1.18 kB

matrix 矩阵

Matrices are stored as one long vector with dimensions rather than as a list of columns so the strategy for extraction of a column is different. 矩阵存储为具有维度的一个长向量而不是列的列表,因此提取列的策略是不同的。 If we look at the memory used by a matrix m , an extracted column m2 and both together we see below that both together use the sum of the memories of the individual objects showing that there was data copying. 如果我们查看矩阵m使用的内存,提取的列m2和两者一起我们在下面看到,它们一起使用各个对象的存储器的总和,表明存在数据复制。

set.seed(123)

n <- 10000L
m <- matrix(rnorm(2*n), n, 2)
m2 <- m[, 2]

object_size(m)
## 160 kB
object_size(m2)
## 80 kB
object_size(m, m2) 
## 240 kB  <-- unlike for data.frames this equals sum of above

what to do 该怎么办

If your program is such that it uses column extraction up to a point only you could use a data frame for that portion and then do a one time conversion to matrix and process it like that for the rest. 如果您的程序使用列提取到一定程度,那么您可以使用该部分的数据框,然后进行一次转换为矩阵并对其余部分进行处理。

I suppose it is about the data structure of R in the memory. 我想它是关于内存中R的数据结构。 A matrix in R is a 2-d array, which is the same of 1-d array. R中的矩阵是2-d阵列,与1-d阵列相同。 A variable is a point directly to the memory, so it would be very faster to extract a single value. 变量是直接指向内存的点,因此提取单个值会更快。 To extract a column in the matrix, it would take some computation and ask for new memory address and save it. 要提取矩阵中的列,需要进行一些计算并请求新的内存地址并保存。 As for dataframe, it is actually a list of columns, so it would be faster to return a column. 至于数据框架,它实际上是一个列列表,因此返回列会更快。 That's what i guess, hope to be proved. 这就是我想的,希望被证明。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM