R中for循环的更快替代方法，后者使用另一个循环调用函数

Question

I am trying to parse a huge dataset into R (1.3Gb). 我正在尝试将巨大的数据集解析为R（1.3Gb）。 The original data is a list comprised of four million of characters, being each one of them an observation of 137 variables. 原始数据是一个包含400万个字符的列表，每个字符都包含137个变量。

First I've created a function that separates the character according to the key provided in the dataset, where "d" is each one of the characters. 首先，我创建了一个函数，该函数根据数据集中提供的键分隔字符，其中“ d”是每个字符。 For the purpose of this question imagine that d has this form 出于这个问题的目的，假设d具有这种形式

"2005400d" “ 2005400d”

and the key would be 关键是

varName <- c("YEAR","AGE","GENDER","STATUS")
varIn   <- c(1,5,7,8)
varEND  <- c(4,6,7,8)

where varIn and varEnd track the splitting points. 其中varIn和varEnd跟踪拆分点。 The function created was. 创建的函数为。

parseLine<-function(d){
  k<-unlist(strsplit(d,""))
  vec<-rep(NA,length(varName))
  for (i in 1:length(varName)){
    vec[i]<-paste(k[varIn[i]:varEnd[i]],sep="",collapse="")
  }
  return(vec)
}

And then in order to loop over all the data available, I've created a for loop. 然后为了遍历所有可用数据，我创建了一个for循环。

df<-data.frame(matrix(ncol=length(varName)))
names(df)<-as.character(varName)

for (i in 1:length(data)){
  df<-rbind(df,parseLine(data[i]))
}

However when I check the function with 1,000 iterations I got a system time of 10.82 seconds, but when I increase that to 10,000 instead of having a time of 108.2 seconds I've got a time of 614.77 which indicates that as the number of iterations increases the time needed would increase exponentially. 但是，当我通过1,000次迭代检查功能时，系统时间为10.82秒，但是当我将其增加到10,000而不是108.2秒的时间时，系统时间为614.77，这表明随着迭代次数的增加所需时间将成倍增加。

Any suggestion for speeding up the process? 有什么建议可以加快流程吗？ I've tried to use the library foreach, but it didn't use the parallel as I expected. 我尝试过使用库foreach，但是没有像我期望的那样使用并行。

m<-foreach(i=1:10,.combine=rbind) %dopar% parseLine(data[i])
df<-a
names(df)<-as.character(varName)

Answer 1

Why re-invent the wheel? 为什么要重新发明轮子？ Use read.fwf in the utils package (attached by default) 在utils软件包中使用read.fwf（默认情况下为附件）

> dat <- "2005400d"
> varName <- c("YEAR","AGE","GENDER","STATUS")
> varIn   <- c(1,5,7,8)
> varEND  <- c(4,6,7,8)
> read.fwf(textConnection(dat), col.names=varName, widths=1+varEND-varIn)
  YEAR AGE GENDER STATUS
1 2005  40      0      d

You should get further efficiency if you specify colClasses but my effort to demonstrate this failed to show a difference. 如果您指定colClasses，您应该会进一步提高效率，但是我为证明这一点所做的努力没有显示出任何区别。 Perhaps that advice only applies to read.table and cousins. 也许该建议仅适用于read.table和表兄弟。

R中for循环的更快替代方法，后者使用另一个循环调用函数

问题描述

1 个解决方案

解决方案1
3 2014-07-06 20:20:09

R中for循环的更快替代方法，后者使用另一个循环调用函数

问题描述

1 个解决方案

解决方案1 3 2014-07-06 20:20:09

解决方案1
3 2014-07-06 20:20:09