简体   繁体   English

object.size更快的替代方案?

[英]faster alternative to object.size?

Does there exist a faster way to identify an object size other than object.size (or a method to have it execute more quickly)? 是否存在更快的方法来识别object.size以外的对象大小(或者让它更快地执行的方法)?

start.time <- Sys.time()
object.size(DB.raw)
#  5361302280 bytes
Sys.time() - start.time
#  Time difference of 1.644485 mins  <~~~  A minute and a half simply
                                           to show the size

print.dims(DB.raw)
#  43,581,894 rows X 15 cols 

I'm wondering also why it takes so long to compute the object size? 我想知道为什么计算对象大小需要这么长时间? Presumably for each column it has to traverse each row to find the total size for that column? 据推测,对于每一列,它必须遍历每一行才能找到该列的总大小?

On a Windows box, you might be able to get a pretty close estimate using gc() and memory.size() before and after creating DB.raw . 在Windows框中,您可以在创建DB.raw之前和之后使用gc()memory.size()获得非常接近的估计。

gc()
x <- memory.size()
# DB.raw created, extra variables rm'ed
gc()
memory.size() - x # estimate of DB.raw size in Mb
# object.size(DB.raw) / 1048576 # for comparison

The most likely reason it is taking a long time is because you have character objects. 花费很长时间的最可能原因是因为你有角色对象。 This seems to be because it needs to count the number of characters in order to determine the size, although I'm not sure. 这似乎是因为它需要计算字符数以确定大小,虽然我不确定。

x<-rep(paste0(letters[1:3],collapse=""),1e8)
system.time(object.size(x))
##  user  system elapsed 
## 1.608   0.592   2.202 
x<-rep(0.5,1e9)
system.time(object.size(x))
## user  system elapsed 
## 0.000   0.000   0.001 

We can see longer strings taking up more space (at least in some cases) like this: 我们可以看到更长的字符串占用更多空间(至少在某些情况下),如下所示:

> x<-replicate(1e5,paste0(letters[sample(26,3)],collapse=""))
> x1<-replicate(1e5,paste0(letters[sample(26,2)],collapse=""))
> object.size(x)
1547544 bytes
> object.size(x1)
831240 bytes

I can't think of any way around this, if you need an exact size. 如果你需要一个确切的尺寸,我想不出任何解决方法。 However, you can get a highly accurate estimate of the size by sampling a large number of rows and calling object.size() on the sample to get an estimate of the size per row, and then multiplying by the total number of rows you have. 但是,您可以通过对大量行进行采样并在样本上调用object.size()来获得高度准确的大小估计,以估算每行的大小,然后乘以您拥有的总行数。

For example: 例如:

estObjectSize<-function(x,n=1e5){
  length(x)*object.size(sample(x,n))/n
}
x0<-sapply(1:20,function(x) paste0(letters[1:x],collapse=""))
x<-x0[sample(20,1e8,T)]

> system.time(size<-object.size(x))
   user  system elapsed 
  1.632   0.856   2.495 
> system.time(estSize<-estObjectSize(x))
   user  system elapsed 
  0.012   0.000   0.013 
> size
800001184 bytes
> estSize
801184000 bytes

You have to tweak the code a bit to get it to work for a data frame, but this is the idea. 您必须稍微调整一下代码才能使其适用于数据框,但这是个主意。

To add: it looks like the number of bytes per character to store an array of strings depends on a few things, including string interning and excess allocated buffer memory used during string construction. 添加:看起来每个字符存储字符串数组的字节数取决于一些事情,包括字符串实习和字符串构造期间使用的多余分配缓冲区内存。 It's certainly not as simple as multiplying by the number of strings, and it is not surprising that it should take longer: 它肯定不像乘以字符串的数量那么简单,并且它应该花费更长时间并不奇怪:

> bytesPerString<-sapply(1:20,
+   function(x)
+       object.size(replicate(1e5,paste0(letters[sample(26,x)],collapse="")))/1e5)
> bytesPerString
 [1]  8.01288  8.31240 15.47928 49.87848 55.71144 55.98552 55.99848 64.00040
 [9] 64.00040 64.00040 64.00040 64.00040 64.00040 64.00040 64.00040 80.00040
[17] 80.00040 80.00040 80.00040 80.00040
> bytesPerChar<-(bytesPerString-8)/(1:20+1)
> bytesPerChar
 [1] 0.0064400 0.1041333 1.8698200 8.3756960 7.9519067 6.8550743 5.9998100
 [8] 6.2222667 5.6000400 5.0909455 4.6667000 4.3077231 4.0000286 3.7333600
[15] 3.5000250 4.2353176 4.0000222 3.7894947 3.6000200 3.4285905

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM