简体   繁体   English

理解R中data.table的引用属性

[英]understanding the reference properties of data.table in R

Just to clear some stuff up for myself, I would like to better understand when copies are made and when they are not in data.table . 为了清除自己的一些内容,我想更好地了解何时制作副本以及何时不在data.table As this question points out Understanding exactly when a data.table is a reference to (vs a copy of) another data.table , if one simply runs the following then you end up modifying the original: 正如这个问题所指出的那样, 确切地了解data.table是否是对另一个data.table的引用(与副本相比) ,如果只是运行以下内容,那么最终修改原始内容:

library(data.table)

DT <- data.table(a=c(1,2), b=c(11,12))
print(DT)
#      a  b
# [1,] 1 11
# [2,] 2 12

newDT <- DT        # reference, not copy
newDT[1, a := 100] # modify new DT

print(DT)          # DT is modified too.
#        a  b
# [1,] 100 11
# [2,]   2 12

However, if one does this (for example), then you end up modifying the new version: 但是,如果这样做(例如),那么您最终修改新版本:

DT = data.table(a=1:10)
DT
     a
 1:  1
 2:  2
 3:  3
 4:  4
 5:  5
 6:  6
 7:  7
 8:  8
 9:  9
10: 10

newDT = DT[a<11]
newDT
     a
 1:  1
 2:  2
 3:  3
 4:  4
 5:  5
 6:  6
 7:  7
 8:  8
 9:  9
10: 10

newDT[1:5,a:=0L]

newDT
     a
 1:  0
 2:  0
 3:  0
 4:  0
 5:  0
 6:  6
 7:  7
 8:  8
 9:  9
10: 10

DT
     a
 1:  1
 2:  2
 3:  3
 4:  4
 5:  5
 6:  6
 7:  7
 8:  8
 9:  9
10: 10

As I understand it, the reason this happens is because when you execute a i statement, data.table returns a whole new table as opposed to a reference to the memory occupied by the select elements of the old data.table . 据我所知,发生这种情况的原因是因为当你执行一个i语句时, data.table返回一个全新的表,而不是对旧data.table的select元素占用的内存的引用。 Is this correct and true? 这是正确的吗?

EDIT: sorry i meant i not j (changed this above) 编辑:对不起,我的意思是i不是j (改为上面这个)

When you create newDT in the second example, you are evaluating i (not j ). 在第二个示例中创建newDT时,您正在评估i (而不是j )。 := assigns by reference within the j argument. :=j参数中通过引用分配。 There are no equivalents in the i statement, as the self reference over allocates the columns, but not the rows. i语句中没有等价物,因为自引用会分配列,但不会分配行。

A data.table is a list. data.table是一个列表。 It has length == the number of columns, but is over allocated so you can add more columns without copying the entire table (eg using := in j ) 它具有length ==列数,但是已经过度分配,因此您可以添加更多列而无需复制整个表(例如,使用:= j

If we inspect the data.table, then we can see the truelength ( tl = 100 ) -- that is the numbe of column pointer slots 如果我们检查data.table,那么我们可以看到truelengthtl = 100 ) - 这是列指针槽的数量

 .Internal(inspect(DT))
@1427d6c8 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=1, tl=100)
  @b249a30 13 INTSXP g0c4 [NAM(2)] (len=10, tl=0) 1,2,3,4,5,...

Within the data.table each element has length 10 , and tl=0 . 在data.table中,每个元素的长度为10tl=0 Currently there is no method to increase the truelength of the columns to allow appending extra rows by reference. 目前没有方法可以增加列的truelength ,以允许通过引用追加额外的行。

From ?truelength 来自?truelength

Currently, it's just the list vector of column pointers that is over-allocated (ie truelength(DT)), not the column vectors themselves, which would in future allow fast row insert() 目前,它只是列指针的列表向量过度分配(即truelength(DT)),而不是列向量本身,这将在未来允许快速行插入()

When you evaluate i , data.table doesn't check whether you have simply returned all rows in the same order as in the original (and then not copy only in that case), it simply returns the copy. 当你评估idata.table不会检查你是否只是以原始顺序返回所有行(然后在这种情况下不再复制),它只返回副本。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM