简体   繁体   中英

Faster alternative to for loop in R which calls a function with another loop

I am trying to parse a huge dataset into R (1.3Gb). The original data is a list comprised of four million of characters, being each one of them an observation of 137 variables.

First I've created a function that separates the character according to the key provided in the dataset, where "d" is each one of the characters. For the purpose of this question imagine that d has this form

"2005400d"

and the key would be

varName <- c("YEAR","AGE","GENDER","STATUS")
varIn   <- c(1,5,7,8)
varEND  <- c(4,6,7,8)

where varIn and varEnd track the splitting points. The function created was.

parseLine<-function(d){
  k<-unlist(strsplit(d,""))
  vec<-rep(NA,length(varName))
  for (i in 1:length(varName)){
    vec[i]<-paste(k[varIn[i]:varEnd[i]],sep="",collapse="")
  }
  return(vec)
}

And then in order to loop over all the data available, I've created a for loop.

df<-data.frame(matrix(ncol=length(varName)))
names(df)<-as.character(varName)

for (i in 1:length(data)){
  df<-rbind(df,parseLine(data[i]))
}

However when I check the function with 1,000 iterations I got a system time of 10.82 seconds, but when I increase that to 10,000 instead of having a time of 108.2 seconds I've got a time of 614.77 which indicates that as the number of iterations increases the time needed would increase exponentially.

Any suggestion for speeding up the process? I've tried to use the library foreach, but it didn't use the parallel as I expected.

m<-foreach(i=1:10,.combine=rbind) %dopar% parseLine(data[i])
df<-a
names(df)<-as.character(varName)

Why re-invent the wheel? Use read.fwf in the utils package (attached by default)

> dat <- "2005400d"
> varName <- c("YEAR","AGE","GENDER","STATUS")
> varIn   <- c(1,5,7,8)
> varEND  <- c(4,6,7,8)
> read.fwf(textConnection(dat), col.names=varName, widths=1+varEND-varIn)
  YEAR AGE GENDER STATUS
1 2005  40      0      d

You should get further efficiency if you specify colClasses but my effort to demonstrate this failed to show a difference. Perhaps that advice only applies to read.table and cousins.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM