[英]R: loop over columns in data.table
I want to determine the column classes of a large data.table. 我想确定一个大型data.table的列类。
colClasses <- sapply(DT, FUN=function(x)class(x)[1])
works, but apparently local copies are stored into memory: 有效,但显然本地副本存储在内存中:
> memory.size()
[1] 687.59
> colClasses <- sapply(DT, class)
> memory.size()
[1] 1346.21
A loop seems not possible, because a data.table "with=FALSE" always results in a data.table. 循环似乎不可能,因为data.table“with = FALSE”总是产生data.table。
A quick and very dirty method is: 一种快速而又非常脏的方法是:
DT1 <- DT[1, ]
colClasses <- sapply(DT1, FUN=function(x)class(x)[1])
What is the most elegent and efficient way to do this? 最优雅,最有效的方法是什么?
Have briefly investigated, and it looks like a data.table
bug. 进行了简单的调查,它看起来像一个
data.table
错误。
> DT = data.table(a=1:1e6,b=1:1e6,c=1:1e6,d=1:1e6)
> Rprofmem()
> sapply(DT,class)
a b c d
"integer" "integer" "integer" "integer"
> Rprofmem(NULL)
> noquote(readLines("Rprofmem.out"))
[1] 4000040 :"as.list.data.table" "as.list" "lapply" "sapply"
[2] 4000040 :"as.list.data.table" "as.list" "lapply" "sapply"
[3] 4000040 :"as.list.data.table" "as.list" "lapply" "sapply"
[4] 4000040 :"as.list.data.table" "as.list" "lapply" "sapply"
> tracemem(DT)
> sapply(DT,class)
tracemem[000000000431A290 -> 00000000065D70D8]: as.list.data.table as.list lapply sapply
a b c d
"integer" "integer" "integer" "integer"
So, looking at as.list.data.table
: 所以,看看
as.list.data.table
:
> data.table:::as.list.data.table
function (x, ...)
{
ans <- unclass(x)
setattr(ans, "row.names", NULL)
setattr(ans, "sorted", NULL)
setattr(ans, ".internal.selfref", NULL)
ans
}
<environment: namespace:data.table>
>
Note the pesky unclass
on the first line. 注意第一行的讨厌的
unclass
。 ?unclass
confirms that it takes a deep copy of its argument. ?unclass
确认它需要深入复制其参数。 From this quick look it doesn't seem like sapply
or lapply
are doing the copying (I didn't think they did since R is good at copy-on-write, and those aren't writing), but rather the as.list
in lapply
(which dispatches to as.list.data.table
). 从这个快看它似乎并不像
sapply
或lapply
正在做的拷贝(我不认为他们做了,因为R是擅长写入时复制,而那些不写),而是as.list
在lapply
(发送到as.list.data.table
)。
So, if we avoid the unclass
, it should speed up. 所以,如果我们避免使用
unclass
,它应该加速。 Let's try: 我们试试吧:
> DT = data.table(a=1:1e7,b=1:1e7,c=1:1e7,d=1:1e7)
> system.time(sapply(DT,class))
user system elapsed
0.28 0.06 0.35
> system.time(sapply(DT,class)) # repeat timing a few times and take minimum
user system elapsed
0.17 0.00 0.17
> system.time(sapply(DT,class))
user system elapsed
0.13 0.04 0.18
> system.time(sapply(DT,class))
user system elapsed
0.14 0.03 0.17
> assignInNamespace("as.list.data.table",function(x)x,"data.table")
> data.table:::as.list.data.table
function(x)x
> system.time(sapply(DT,class))
user system elapsed
0 0 0
> system.time(sapply(DT,class))
user system elapsed
0.01 0.00 0.02
> system.time(sapply(DT,class))
user system elapsed
0 0 0
> sapply(DT,class)
a b c d
"integer" "integer" "integer" "integer"
>
So, yes, infinitely better. 所以,是的, 无限好。
I've raised bug report #2000 to remove the as.list.data.table
method, since a data.table
is()
already a list
, too. 我已经提出错误报告#2000来删除
as.list.data.table
方法,因为data.table
is()
已经是一个list
。 This might speed up quite a few idioms actually, such as lapply(.SD,...)
. 这实际上可能会加速很多习语,例如
lapply(.SD,...)
。 [EDIT: This was fixed in v1.8.1]. [编辑:这在v1.8.1中得到修复]。
Thanks for asking this question!! 谢谢你问这个问题!!
I don't see anything wrong in an approach like this 在这样的方法中我没有看到任何错误
colClasses <- sapply(head(DT1,1), FUN=class)
it is basically your quick'n'dirty solution but perhaps a bit clearer (even if not so much)... 它基本上是你的快速解决方案,但也许更清晰(即使不是那么多)......
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.