简体   繁体   中英

faster alternative to object.size?

Does there exist a faster way to identify an object size other than object.size (or a method to have it execute more quickly)?

start.time <- Sys.time()
object.size(DB.raw)
#  5361302280 bytes
Sys.time() - start.time
#  Time difference of 1.644485 mins  <~~~  A minute and a half simply
                                           to show the size

print.dims(DB.raw)
#  43,581,894 rows X 15 cols 

I'm wondering also why it takes so long to compute the object size? Presumably for each column it has to traverse each row to find the total size for that column?

On a Windows box, you might be able to get a pretty close estimate using gc() and memory.size() before and after creating DB.raw .

gc()
x <- memory.size()
# DB.raw created, extra variables rm'ed
gc()
memory.size() - x # estimate of DB.raw size in Mb
# object.size(DB.raw) / 1048576 # for comparison

The most likely reason it is taking a long time is because you have character objects. This seems to be because it needs to count the number of characters in order to determine the size, although I'm not sure.

x<-rep(paste0(letters[1:3],collapse=""),1e8)
system.time(object.size(x))
##  user  system elapsed 
## 1.608   0.592   2.202 
x<-rep(0.5,1e9)
system.time(object.size(x))
## user  system elapsed 
## 0.000   0.000   0.001 

We can see longer strings taking up more space (at least in some cases) like this:

> x<-replicate(1e5,paste0(letters[sample(26,3)],collapse=""))
> x1<-replicate(1e5,paste0(letters[sample(26,2)],collapse=""))
> object.size(x)
1547544 bytes
> object.size(x1)
831240 bytes

I can't think of any way around this, if you need an exact size. However, you can get a highly accurate estimate of the size by sampling a large number of rows and calling object.size() on the sample to get an estimate of the size per row, and then multiplying by the total number of rows you have.

For example:

estObjectSize<-function(x,n=1e5){
  length(x)*object.size(sample(x,n))/n
}
x0<-sapply(1:20,function(x) paste0(letters[1:x],collapse=""))
x<-x0[sample(20,1e8,T)]

> system.time(size<-object.size(x))
   user  system elapsed 
  1.632   0.856   2.495 
> system.time(estSize<-estObjectSize(x))
   user  system elapsed 
  0.012   0.000   0.013 
> size
800001184 bytes
> estSize
801184000 bytes

You have to tweak the code a bit to get it to work for a data frame, but this is the idea.

To add: it looks like the number of bytes per character to store an array of strings depends on a few things, including string interning and excess allocated buffer memory used during string construction. It's certainly not as simple as multiplying by the number of strings, and it is not surprising that it should take longer:

> bytesPerString<-sapply(1:20,
+   function(x)
+       object.size(replicate(1e5,paste0(letters[sample(26,x)],collapse="")))/1e5)
> bytesPerString
 [1]  8.01288  8.31240 15.47928 49.87848 55.71144 55.98552 55.99848 64.00040
 [9] 64.00040 64.00040 64.00040 64.00040 64.00040 64.00040 64.00040 80.00040
[17] 80.00040 80.00040 80.00040 80.00040
> bytesPerChar<-(bytesPerString-8)/(1:20+1)
> bytesPerChar
 [1] 0.0064400 0.1041333 1.8698200 8.3756960 7.9519067 6.8550743 5.9998100
 [8] 6.2222667 5.6000400 5.0909455 4.6667000 4.3077231 4.0000286 3.7333600
[15] 3.5000250 4.2353176 4.0000222 3.7894947 3.6000200 3.4285905

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM