简体   繁体   English

R:使用`strsplit`耗尽内存

[英]R: running out of memory using `strsplit`

I am running out of memory using strsplit (presumably); 我使用strsplit (大概)用完了内存; here is the code: 这是代码:

split.fields <- function (frame, fields, split, suffix, ...) {
  for (field in fields) {
    v <- sapply(strsplit(frame[[field]],"@",...),"[",1)
    frame[[paste0(field,suffix)]] <- frame[[field]]
    frame[[field]] <- v
  }
  frame
}
split.version <- function (frame, fields)
  split.fields(frame, fields, split="@", suffix="Ver", fixed=TRUE)
> gc()
         used (Mb) gc trigger (Mb) max used (Mb)
Ncells 238165 12.8     467875   25   407500 21.8
Vcells 369492  2.9     905753    7   905631  7.0
> frame <- data.frame(browser = sample(c("Chrome@28","Chrome@27","Firefox@21","Firefox@22","IE@9","IE@8"), 1e7, replace=TRUE), stringsAsFactors=FALSE)
> str(frame)
'data.frame':   10000000 obs. of  1 variable:
 $ browser: chr  "IE@8" "Chrome@27" "Chrome@27" "Chrome@27" ...
> object.size(frame)
80000992 bytes
> gc()
           used (Mb) gc trigger  (Mb) max used  (Mb)
Ncells   240555 12.9     467875  25.0   407500  21.8
Vcells 10373979 79.2   34109873 260.3 40534688 309.3
> system.time(frame <- split.version(frame,"browser"))
   user  system elapsed 
 73.700   0.872  74.831 
> object.size(frame)
160001248 bytes
> str(frame)
'data.frame':   10000000 obs. of  2 variables:
 $ browser   : chr  "IE" "Chrome" "Chrome" "Chrome" ...
 $ browserVer: chr  "IE@8" "Chrome@27" "Chrome@27" "Chrome@27" ...
> gc()
           used  (Mb) gc trigger  (Mb)  max used   (Mb)
Ncells   264888  14.2   16652260 889.4  31376740 1675.7
Vcells 20459856 156.1   95461025 728.4 119226749  909.7

This all looks more or less reasonable except that the R process's RSS is now 1.6G . 除了R进程的RSS现在是1.6G之外,这看起来或多或少都是合理的。

This appears to imply that the 1675.7Mb of Ncells in the max used column have not been returned to the OS. 这似乎意味着max used列中的1675.7Mb Ncells尚未返回到操作系统。

I don't care much about the OS not getting back the RAM, what I do care is that to process 80M of data R allocated 1.6G (and on my real data it runs out of the physical RAM available) 我不太在乎操作系统是否没有收回RAM,我关心的是处理分配给1.6G的80M数据R(在我的真实数据上,它用尽了可用的物理RAM)

Is there a way to make this more memory efficient? 有没有办法使这种内存效率更高?

Eg, maybe converting the character vector to a factor and operating on its levels would help? 例如,将字符向量转换为一个因子并在其级别上运行会有所帮助吗?

R version 3.0.1 (2013-05-16) -- "Good Sport"
Platform: x86_64-pc-linux-gnu (64-bit)

How about using substr and regexpr : 如何使用substrregexpr

x <- c("Chrome@28","Chrome@27","Firefox@21","IE@8")
substr(x,1,regexpr("@",x)-1)
[1] "Chrome"  "Chrome"  "Firefox" "IE" 

What @James said, or even simpler: @James说的甚至更简单:

x <- c("Chrome@28","Chrome@27","Firefox@21","IE@8")
sub('@.*', '', x)
#[1] "Chrome"  "Chrome"  "Firefox" "IE"  

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM