How to optimize and faster for loop when handling large dataset in R?

Question

Currently, I'm working on the data transformation. The data is not super large, about 190k rows.

I wrote a for loop like this:

for (i in 1:nrow(df2)){
#a
record.a <- df[which(df$first_lat==df2[i,"third_lat"] 
            & df$first_lon==df2[i,"third_lon"] 
            & df$sixth_lat==df2[i,"fourth_lat"] 
            & df$sixth_lon==df2[i,"fourth_lon"] 
            & df[,4]==df2[i,4] 
            & df[,3]==df2[i,5]),]
df2[i,18] <- ifelse(nrow(record.a) != 0,record.a$order_cnt,NA)

#b
record.b <- df[which(df$fifth_lat==df2[i,"third_lat"] 
            & df$fifth_lon==df2[i,"third_lon"] 
            & df$sixth_lat==df2[i,"second_lat"] 
            & df$sixth_lon==df2[i,"second_lon"] 
            & df[,4]==df2[i,4] 
            & df[,3]==df2[i,5]),]
df2[i,19] <- ifelse(nrow(record.b) != 0,record.b$order_cnt,NA)

#c
record.c <- df[which(df$fifth_lat==df2[i,"first_lat"] 
            & df$fifth_lon==df2[i,"first_lon"] 
            & df$fourth_lat==df2[i,"second_lat"] 
            & df$fourth_lon==df2[i,"second_lon"] 
            & df[,4]==df2[i,4] 
            & df[,3]==df2[i,5]),]
df2[i,20] <- ifelse(nrow(record.c) != 0,record.c$order_cnt,NA)

#d
record.d <- df[which(df$third_lat==df2[i,"first_lat"] 
            & df$third_lon==df2[i,"first_lon"] 
            & df$fourth_lat==df2[i,"sixth_lat"] 
            & df$fourth_lon==df2[i,"sixth_lon"] 
            & df[,4]==df2[i,4] 
            & df[,3]==df2[i,5]),]
df2[i,21] <- ifelse(nrow(record.d) != 0,record.d$order_cnt,NA)

#e
record.e <- df[which(df$third_lat==df2[i,"fifth_lat"] 
            & df$third_lon==df2[i,"fifth_lon"] 
            & df$second_lat==df2[i,"sixth_lat"] 
            & df$second_lon==df2[i,"sixth_lon"] 
            & df[,4]==df2[i,4] 
            & df[,3]==df2[i,5]),]
df2[i,22] <- ifelse(nrow(record.e) != 0,record.e$order_cnt,NA)

#f
record.f <- df[which(df$first_lat==df2[i,"fifth_lat"] 
            & df$first_lon==df2[i,"fifth_lon"] 
            & df$second_lat==df2[i,"fourth_lat"] 
            & df$second_lon==df2[i,"fourth_lon"] 
            & df[,4]==df2[i,4] 
            & df[,3]==df2[i,5]),]
df2[i,23] <- ifelse(nrow(record.f) != 0,record.f$order_cnt,NA)
}

So, basically, I need to fill out 6 columns of df2 respectively from df with 6 criteria. In the for loop, nrow(df2) is about 190k. It runs super slow. But I used view(df2) to check it and it runs fine. So is there any method I could make it faster? I may apply the same data transformation to a much larger dataset in the future.

df: df

df2: df2

The data is about grids on a map. df2 is basically a subset of df but add 6 additional columns. Both df and df2 has the same lon and lat information.

Each grid_id stands for a hexagon area in a map. Each hexagon is connected to other six hexagons by two pairs of lon and lat. What I want to do is that find a particular values from the six surrounding hexagons (in df) to fill into columns (a, b, c, d, e, f) in df2. Also, I need two other conditions, which is hours, ten_mins_interval. (df[,4]==df2[i,4] & df[,3]==df2[i,5]))

So I think the logic is:

For each grid_id, hours, ten_mins_interval (1 row) in df2
find the corresponding 6 grid_ids (6 rows) with same hours, ten_mins_interval in df
fill order_cnt from those 6 rows into a,b,c,d,e,f columns in df2

Answer 1

If you start with your current df2[,1:17] you can add df[,18] with the merge command:

df2 <- merge(df[,c("first_lat","first_lon","sixth_lat","sixth_lon","col4name","col5name","order_cn")],
      df2,
      by.x=c("first_lat","first_lon","sixth_lat","sixth_lon","col4name","col5name"),
      by.y=c("third_lat","third_lon","fourth_lat","fourth_lon","col4name","col3name"),
      all.y=TRUE)

You'll need to replace col4name with the name of the fourth column and so on - I can't see from the screenshot what that might be. Five more versions of this command can be easily generated to add the other five columns. As the operation works on whole vectors at time, it's likely to be faster than looping. As data wasn't provided in a suitable format this isn't tested.

How to optimize and faster for loop when handling large dataset in R?

Question

1 answers

solution1
0 ACCPTED 2017-06-07 15:32:37

How to optimize and faster for loop when handling large dataset in R?

Question

1 answers

solution1 0 ACCPTED 2017-06-07 15:32:37

solution1
0 ACCPTED 2017-06-07 15:32:37