[英]Extract column from data.frame faster than from matrix - why?
I'm running a simulation where I need to repeatedly extract 1 column from a matrix and check each of its values against some condition (eg < 10). 我正在运行模拟,我需要从矩阵中重复提取1列,并根据某些条件检查每个值(例如<10)。 However, doing so with a matrix is 3 times slower than doing the same thing with a data.frame.
但是,使用矩阵执行此操作比使用data.frame执行相同操作慢3倍。 Why is this the case?
为什么会这样?
I'd like to to use matrixes to store the simulation data because they are faster for some other operations (eg updating columns by adding/subtracting values). 我想使用矩阵来存储模拟数据,因为它们对于其他一些操作来说更快(例如通过添加/减去值来更新列)。 How can I extract columns / subset a matrix in a faster way?
如何以更快的方式提取列/子集矩阵?
df <- data.frame(a = 1:1e4)
m <- as.matrix(df)
library(microbenchmark)
microbenchmark(
df$a,
m[ , "a"])
# Results; Unit: microseconds
# expr min lq mean median uq max neval cld
# df$a 5.463 5.8315 8.03997 6.612 8.0275 57.637 100 a
# m[ , "a"] 64.699 66.6265 72.43631 73.759 75.5595 117.922 100 b
microbenchmark(
df[1, 1],
df$a[1],
m[1, 1],
m[ , "a"][1])
# Results; Unit: nanoseconds
# expr min lq mean median uq max neval cld
# df[1, 1] 8248 8753.0 10198.56 9818.5 10689.5 48159 100 c
# df$a[1] 4072 4416.0 5247.67 5057.5 5754.5 17993 100 b
# m[1, 1] 517 708.5 828.04 810.0 920.5 2732 100 a
# m[ , "a"][1] 45745 47884.0 51861.90 49100.5 54831.5 105323 100 d
I expected the matrix column extraction to be faster, but it was slower. 我期望矩阵列提取更快,但速度更慢。 However, extracting a single value from a matrix (ie
m[1, 1]
) was faster than both of the ways of doing so with a data.frame. 但是,从矩阵中提取单个值(即
m[1, 1]
)比使用data.frame这两种方式更快。 I'm lost as to why this is. 我迷失了为什么会这样。
The above is only true for selecting columns. 以上仅适用于选择列。 When selecting rows, matrices are much faster than data.frames.
选择行时,矩阵比data.frames快得多。 Still don't know why.
还是不知道为什么。
microbenchmark(
df[1, ],
m[1, ],
df[ , 1],
m[ , 1])
# Result: Unit: nanoseconds
# expr min lq mean median uq max neval cld
# df[1, ] 16359 17243.5 18766.93 17860.5 19849.5 42973 100 c
# m[1, ] 718 999.5 1175.95 1181.0 1327.0 3595 100 a
# df[ , 1] 7664 8687.5 9888.57 9301.0 10535.5 42312 100 b
# m[ , 1] 64874 66218.5 72074.93 73717.5 74084.5 97827 100 d
Consider the builtin data frame BOD
. 考虑内置数据帧
BOD
。 data frames are stored as a list of columns and the inspect
output shown below shows the address of each of the two columns of BOD
. 数据帧存储为列列表,下面显示的
inspect
输出显示BOD
的两列中的每一列的地址。 We then assign its second column to BOD2
. 然后我们将其第二列分配给
BOD2
。 Note that the address of BOD2
is the same memory location as the second column shown in the inspect
output for BOD
. 请注意,
BOD2
的地址与BOD
的inspect
输出中显示的第二列的内存位置相同。 That is, all R did was have BOD2
point to memory within BOD
in order to create BOD2
. 也就是说,所有R做的都是
BOD2
指向BOD
内的内存以便创建BOD2
。 There was no data movement at all. 根本没有数据移动。 Another way to see this is to compare the size of
BOD
, BOD2
and both together and we see that both together take up the same amount of memory as BOD
so there must have been no copying. 另一种看待这种情况的方法是将
BOD
, BOD2
和BOD
的大小进行比较,我们发现它们共同占用了与BOD
相同的内存量,因此必须没有复制。 (Continued after code.) (继续代码。)
library(pryr)
BOD2 <- BOD[[2]]
inspect(BOD)
## <VECSXP 0x507c278>
## <REALSXP 0x4f81f48>
## <REALSXP 0x4f81ed8> <--- compare this address to address shown below
## ...snip...
BOD2 <- BOD[,2]
address(BOD2)
## [1] "0x4f81ed8"
object_size(BOD)
## 1.18 kB
object_size(BOD2)
## 96 B
object_size(BOD, BOD2) # same as object_size(BOD) above
## 1.18 kB
Matrices are stored as one long vector with dimensions rather than as a list of columns so the strategy for extraction of a column is different. 矩阵存储为具有维度的一个长向量而不是列的列表,因此提取列的策略是不同的。 If we look at the memory used by a matrix
m
, an extracted column m2
and both together we see below that both together use the sum of the memories of the individual objects showing that there was data copying. 如果我们查看矩阵
m
使用的内存,提取的列m2
和两者一起我们在下面看到,它们一起使用各个对象的存储器的总和,表明存在数据复制。
set.seed(123)
n <- 10000L
m <- matrix(rnorm(2*n), n, 2)
m2 <- m[, 2]
object_size(m)
## 160 kB
object_size(m2)
## 80 kB
object_size(m, m2)
## 240 kB <-- unlike for data.frames this equals sum of above
If your program is such that it uses column extraction up to a point only you could use a data frame for that portion and then do a one time conversion to matrix and process it like that for the rest. 如果您的程序使用列提取到一定程度,那么您可以使用该部分的数据框,然后进行一次转换为矩阵并对其余部分进行处理。
I suppose it is about the data structure of R in the memory. 我想它是关于内存中R的数据结构。 A matrix in R is a 2-d array, which is the same of 1-d array.
R中的矩阵是2-d阵列,与1-d阵列相同。 A variable is a point directly to the memory, so it would be very faster to extract a single value.
变量是直接指向内存的点,因此提取单个值会更快。 To extract a column in the matrix, it would take some computation and ask for new memory address and save it.
要提取矩阵中的列,需要进行一些计算并请求新的内存地址并保存。 As for dataframe, it is actually a list of columns, so it would be faster to return a column.
至于数据框架,它实际上是一个列列表,因此返回列会更快。 That's what i guess, hope to be proved.
这就是我想的,希望被证明。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.