简体   繁体   中英

Multiple loops with data.table in R

I try to deal with multiple loops with data.table for quite some time and got frustrated. With sql it seamed quite intuitive, but with RI experience some problems.

For example, I would want to fread one txt file (as I have hundreds of files, each around 1 GB), make the calculations (sum price and quant, when time>my.time and for some selected isin's, grouped by my.time, isin and price), write the results to some csv file, remove original txt file from R memory; then redo these calculations one-by-one for all txt files and append the output csv file.

Lets start with the example data (quite small, just two identical files for illustration):

        time<-format(seq.POSIXt(as.POSIXct(Sys.Date()), as.POSIXct(Sys.Date()+1), by = "1 sec"),"%H:%M:%S")
        n<-length(time)
        isin<-paste("US",1:n,sep="")
        price<-rnorm(n,101,1)
        quant<-rnorm(n,5,1)
        dt<-data.table(time,isin,price,quant)
        write.table(dt,"raw.txt",append = FALSE,sep = ",",col.names = TRUE, row.names = FALSE)
        write.table(dt,"raw2.txt",append = FALSE,sep = ",",col.names = TRUE, row.names = FALSE)

    my.files <- list.files(pattern = "raw*.txt")
    my.time<-format(seq.POSIXt(as.POSIXct(Sys.Date()), as.POSIXct(Sys.Date()+1), by = "5 min"),"%H:%M:%S")
    my.isin<-c("US100","US150","US225","US250","US1050")

Then I try these 2 simple loops:

       for (i in my.files){
              for (j in my.time){
              dt<-fread(i)
        write.table(dt[which(isin %in% my.isin & time>j),
           .(sprice=sum(price),squant=sum(quant),**time.my=j**), by = .(isin,price)],
           "output.csv",append = TRUE,sep = ",",col.names = TRUE)
        rm(dt) 
        }}

Second edit: The loop with j finally started working for me (due to the bolded part). Maybe it would be possible to work without for loops and get the same results?

Thanks very much for help!

The problem you have is the output of your which statement is returning zero rows. First I would convert your time to time type. I've then created a 5 minute grouping variable.

This would first aggregate your tables.

dt[,`:=`(time= as.ITime(strptime(time, format="%H:%M:%S")))]
dt[,`:=`(time5 = format(strptime("1970-01-01", "%Y-%m-%d", tz="UTC") + 
                          round(as.numeric(time)/300)*300,"%H:%M"))]

dt[, list(sprice = sum(price),squant= sum(quant)),by = c("time5","price","isin")][isin %in% my.isin]


#    time5     price    isin    sprice   squant
# 1: 00:00 102.46668     US1 102.46668 3.002960
# 2: 00:00  99.02186     US2  99.02186 5.253252
# 3: 00:00 100.23665     US3 100.23665 6.153950
# 4: 00:00 102.21466     US4 102.21466 3.461051
# 5: 00:00 100.97890     US5 100.97890 5.893336

you can then filter it by your my.isn or time5 greater than custom times?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM