R：循环遍历data.table中的列

Question

I want to determine the column classes of a large data.table. 我想确定一个大型data.table的列类。

colClasses <- sapply(DT, FUN=function(x)class(x)[1])

works, but apparently local copies are stored into memory: 有效，但显然本地副本存储在内存中：

> memory.size()
[1] 687.59
> colClasses <- sapply(DT, class)
> memory.size()
[1] 1346.21

A loop seems not possible, because a data.table "with=FALSE" always results in a data.table. 循环似乎不可能，因为data.table“with = FALSE”总是产生data.table。

A quick and very dirty method is: 一种快速而又非常脏的方法是：

DT1 <- DT[1, ]
colClasses <- sapply(DT1, FUN=function(x)class(x)[1])

What is the most elegent and efficient way to do this? 最优雅，最有效的方法是什么？

Answer 1

Have briefly investigated, and it looks like a data.table bug. 进行了简单的调查，它看起来像一个data.table错误。

> DT = data.table(a=1:1e6,b=1:1e6,c=1:1e6,d=1:1e6)
> Rprofmem()
> sapply(DT,class)
        a         b         c         d 
"integer" "integer" "integer" "integer" 
> Rprofmem(NULL)
> noquote(readLines("Rprofmem.out"))
[1] 4000040 :"as.list.data.table" "as.list" "lapply" "sapply"       
[2] 4000040 :"as.list.data.table" "as.list" "lapply" "sapply" 
[3] 4000040 :"as.list.data.table" "as.list" "lapply" "sapply"   
[4] 4000040 :"as.list.data.table" "as.list" "lapply" "sapply" 

> tracemem(DT)
> sapply(DT,class)
tracemem[000000000431A290 -> 00000000065D70D8]: as.list.data.table as.list lapply sapply 
        a         b         c         d 
"integer" "integer" "integer" "integer"

So, looking at as.list.data.table : 所以，看看as.list.data.table ：

> data.table:::as.list.data.table
function (x, ...) 
{
    ans <- unclass(x)
    setattr(ans, "row.names", NULL)
    setattr(ans, "sorted", NULL)
    setattr(ans, ".internal.selfref", NULL)
    ans
}
<environment: namespace:data.table>
>

Note the pesky unclass on the first line. 注意第一行的讨厌的unclass 。 ?unclass confirms that it takes a deep copy of its argument. ?unclass确认它需要深入复制其参数。 From this quick look it doesn't seem like sapply or lapply are doing the copying (I didn't think they did since R is good at copy-on-write, and those aren't writing), but rather the as.list in lapply (which dispatches to as.list.data.table ). 从这个快看它似乎并不像sapply或lapply正在做的拷贝（我不认为他们做了，因为R是擅长写入时复制，而那些不写），而是as.list在lapply （发送到as.list.data.table ）。

So, if we avoid the unclass , it should speed up. 所以，如果我们避免使用unclass ，它应该加速。 Let's try: 我们试试吧：

> DT = data.table(a=1:1e7,b=1:1e7,c=1:1e7,d=1:1e7)
> system.time(sapply(DT,class))
   user  system elapsed 
   0.28    0.06    0.35 
> system.time(sapply(DT,class))  # repeat timing a few times and take minimum
   user  system elapsed 
   0.17    0.00    0.17 
> system.time(sapply(DT,class))
   user  system elapsed 
   0.13    0.04    0.18 
> system.time(sapply(DT,class))
   user  system elapsed 
   0.14    0.03    0.17 
> assignInNamespace("as.list.data.table",function(x)x,"data.table")
> data.table:::as.list.data.table
function(x)x
> system.time(sapply(DT,class))
   user  system elapsed 
      0       0       0 
> system.time(sapply(DT,class))
   user  system elapsed 
   0.01    0.00    0.02 
> system.time(sapply(DT,class))
   user  system elapsed 
      0       0       0 
> sapply(DT,class)
        a         b         c         d 
"integer" "integer" "integer" "integer" 
>

So, yes, infinitely better. 所以，是的，无限好。

I've raised bug report #2000 to remove the as.list.data.table method, since a data.table is() already a list , too. 我已经提出错误报告＃2000来删除as.list.data.table方法，因为data.table is()已经是一个list 。 This might speed up quite a few idioms actually, such as lapply(.SD,...) . 这实际上可能会加速很多习语，例如lapply(.SD,...) 。 [EDIT: This was fixed in v1.8.1]. [编辑：这在v1.8.1中得到修复]。

Thanks for asking this question!! 谢谢你问这个问题!!

Answer 2

I don't see anything wrong in an approach like this 在这样的方法中我没有看到任何错误

colClasses <- sapply(head(DT1,1), FUN=class)

it is basically your quick'n'dirty solution but perhaps a bit clearer (even if not so much)... 它基本上是你的快速解决方案，但也许更清晰（即使不是那么多）......

R：循环遍历data.table中的列

问题描述

2 个解决方案

解决方案1
10 已采纳 2012-05-14 18:07:33

解决方案2
2 2012-05-14 14:55:01

R：循环遍历data.table中的列

问题描述

2 个解决方案

解决方案1 10 已采纳 2012-05-14 18:07:33

解决方案2 2 2012-05-14 14:55:01

解决方案1
10 已采纳 2012-05-14 18:07:33

解决方案2
2 2012-05-14 14:55:01