[英]understanding the reference properties of data.table in R
Just to clear some stuff up for myself, I would like to better understand when copies are made and when they are not in data.table
. 为了清除自己的一些内容,我想更好地了解何时制作副本以及何时不在
data.table
。 As this question points out Understanding exactly when a data.table is a reference to (vs a copy of) another data.table , if one simply runs the following then you end up modifying the original: 正如这个问题所指出的那样, 确切地了解data.table是否是对另一个data.table的引用(与副本相比) ,如果只是运行以下内容,那么最终修改原始内容:
library(data.table)
DT <- data.table(a=c(1,2), b=c(11,12))
print(DT)
# a b
# [1,] 1 11
# [2,] 2 12
newDT <- DT # reference, not copy
newDT[1, a := 100] # modify new DT
print(DT) # DT is modified too.
# a b
# [1,] 100 11
# [2,] 2 12
However, if one does this (for example), then you end up modifying the new version: 但是,如果这样做(例如),那么您最终修改新版本:
DT = data.table(a=1:10)
DT
a
1: 1
2: 2
3: 3
4: 4
5: 5
6: 6
7: 7
8: 8
9: 9
10: 10
newDT = DT[a<11]
newDT
a
1: 1
2: 2
3: 3
4: 4
5: 5
6: 6
7: 7
8: 8
9: 9
10: 10
newDT[1:5,a:=0L]
newDT
a
1: 0
2: 0
3: 0
4: 0
5: 0
6: 6
7: 7
8: 8
9: 9
10: 10
DT
a
1: 1
2: 2
3: 3
4: 4
5: 5
6: 6
7: 7
8: 8
9: 9
10: 10
As I understand it, the reason this happens is because when you execute a i
statement, data.table
returns a whole new table as opposed to a reference to the memory occupied by the select elements of the old data.table
. 据我所知,发生这种情况的原因是因为当你执行一个
i
语句时, data.table
返回一个全新的表,而不是对旧data.table
的select元素占用的内存的引用。 Is this correct and true? 这是正确的吗?
EDIT: sorry i meant i
not j
(changed this above) 编辑:对不起,我的意思是
i
不是j
(改为上面这个)
When you create newDT
in the second example, you are evaluating i
(not j
). 在第二个示例中创建
newDT
时,您正在评估i
(而不是j
)。 :=
assigns by reference within the j
argument. :=
在j
参数中通过引用分配。 There are no equivalents in the i
statement, as the self reference over allocates the columns, but not the rows. i
语句中没有等价物,因为自引用会分配列,但不会分配行。
A data.table
is a list. data.table
是一个列表。 It has length == the number of columns, but is over allocated so you can add more columns without copying the entire table (eg using :=
in j
) 它具有length ==列数,但是已经过度分配,因此您可以添加更多列而无需复制整个表(例如,使用
:=
j
)
If we inspect the data.table, then we can see the truelength
( tl = 100
) -- that is the numbe of column pointer slots 如果我们检查data.table,那么我们可以看到
truelength
( tl = 100
) - 这是列指针槽的数量
.Internal(inspect(DT))
@1427d6c8 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=1, tl=100)
@b249a30 13 INTSXP g0c4 [NAM(2)] (len=10, tl=0) 1,2,3,4,5,...
Within the data.table each element has length 10
, and tl=0
. 在data.table中,每个元素的长度为
10
, tl=0
。 Currently there is no method to increase the truelength
of the columns to allow appending extra rows by reference. 目前没有方法可以增加列的
truelength
,以允许通过引用追加额外的行。
From ?truelength
来自
?truelength
Currently, it's just the list vector of column pointers that is over-allocated (ie truelength(DT)), not the column vectors themselves, which would in future allow fast row insert()
目前,它只是列指针的列表向量过度分配(即truelength(DT)),而不是列向量本身,这将在未来允许快速行插入()
When you evaluate i
, data.table
doesn't check whether you have simply returned all rows in the same order as in the original (and then not copy only in that case), it simply returns the copy. 当你评估
i
, data.table
不会检查你是否只是以原始顺序返回所有行(然后在这种情况下不再复制),它只返回副本。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.