简体   繁体   中英

How to optimize and faster for loop when handling large dataset in R?

Currently, I'm working on the data transformation. The data is not super large, about 190k rows.

I wrote a for loop like this:

for (i in 1:nrow(df2)){
#a
record.a <- df[which(df$first_lat==df2[i,"third_lat"] 
            & df$first_lon==df2[i,"third_lon"] 
            & df$sixth_lat==df2[i,"fourth_lat"] 
            & df$sixth_lon==df2[i,"fourth_lon"] 
            & df[,4]==df2[i,4] 
            & df[,3]==df2[i,5]),]
df2[i,18] <- ifelse(nrow(record.a) != 0,record.a$order_cnt,NA)

#b
record.b <- df[which(df$fifth_lat==df2[i,"third_lat"] 
            & df$fifth_lon==df2[i,"third_lon"] 
            & df$sixth_lat==df2[i,"second_lat"] 
            & df$sixth_lon==df2[i,"second_lon"] 
            & df[,4]==df2[i,4] 
            & df[,3]==df2[i,5]),]
df2[i,19] <- ifelse(nrow(record.b) != 0,record.b$order_cnt,NA)

#c
record.c <- df[which(df$fifth_lat==df2[i,"first_lat"] 
            & df$fifth_lon==df2[i,"first_lon"] 
            & df$fourth_lat==df2[i,"second_lat"] 
            & df$fourth_lon==df2[i,"second_lon"] 
            & df[,4]==df2[i,4] 
            & df[,3]==df2[i,5]),]
df2[i,20] <- ifelse(nrow(record.c) != 0,record.c$order_cnt,NA)

#d
record.d <- df[which(df$third_lat==df2[i,"first_lat"] 
            & df$third_lon==df2[i,"first_lon"] 
            & df$fourth_lat==df2[i,"sixth_lat"] 
            & df$fourth_lon==df2[i,"sixth_lon"] 
            & df[,4]==df2[i,4] 
            & df[,3]==df2[i,5]),]
df2[i,21] <- ifelse(nrow(record.d) != 0,record.d$order_cnt,NA)

#e
record.e <- df[which(df$third_lat==df2[i,"fifth_lat"] 
            & df$third_lon==df2[i,"fifth_lon"] 
            & df$second_lat==df2[i,"sixth_lat"] 
            & df$second_lon==df2[i,"sixth_lon"] 
            & df[,4]==df2[i,4] 
            & df[,3]==df2[i,5]),]
df2[i,22] <- ifelse(nrow(record.e) != 0,record.e$order_cnt,NA)

#f
record.f <- df[which(df$first_lat==df2[i,"fifth_lat"] 
            & df$first_lon==df2[i,"fifth_lon"] 
            & df$second_lat==df2[i,"fourth_lat"] 
            & df$second_lon==df2[i,"fourth_lon"] 
            & df[,4]==df2[i,4] 
            & df[,3]==df2[i,5]),]
df2[i,23] <- ifelse(nrow(record.f) != 0,record.f$order_cnt,NA)
}

So, basically, I need to fill out 6 columns of df2 respectively from df with 6 criteria. In the for loop, nrow(df2) is about 190k. It runs super slow. But I used view(df2) to check it and it runs fine. So is there any method I could make it faster? I may apply the same data transformation to a much larger dataset in the future.

df: df

df2: df2

The data is about grids on a map. df2 is basically a subset of df but add 6 additional columns. Both df and df2 has the same lon and lat information.

Each grid_id stands for a hexagon area in a map. Each hexagon is connected to other six hexagons by two pairs of lon and lat. What I want to do is that find a particular values from the six surrounding hexagons (in df) to fill into columns (a, b, c, d, e, f) in df2. Also, I need two other conditions, which is hours, ten_mins_interval. (df[,4]==df2[i,4] & df[,3]==df2[i,5]))

So I think the logic is:

  1. For each grid_id, hours, ten_mins_interval (1 row) in df2
  2. find the corresponding 6 grid_ids (6 rows) with same hours, ten_mins_interval in df
  3. fill order_cnt from those 6 rows into a,b,c,d,e,f columns in df2

If you start with your current df2[,1:17] you can add df[,18] with the merge command:

df2 <- merge(df[,c("first_lat","first_lon","sixth_lat","sixth_lon","col4name","col5name","order_cn")],
      df2,
      by.x=c("first_lat","first_lon","sixth_lat","sixth_lon","col4name","col5name"),
      by.y=c("third_lat","third_lon","fourth_lat","fourth_lon","col4name","col3name"),
      all.y=TRUE)

You'll need to replace col4name with the name of the fourth column and so on - I can't see from the screenshot what that might be. Five more versions of this command can be easily generated to add the other five columns. As the operation works on whole vectors at time, it's likely to be faster than looping. As data wasn't provided in a suitable format this isn't tested.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM