简体   繁体   中英

How can I simplify this code (r) in which I am using information from an original data set to create a new dataset?

I have a data set that I am trying to use to generate a different data set in R. The dataset has many columns; but the three relevant columns for generating the new data set are "Reach", "Results", and "DV". Reach and results are numeric. DV is binary with 0s and 1s. In the original dataset, all rows have DV = 0.

For each row of the original data set, I am attempting to take one variable "Reach" and replicate that row "reach" number of times. Then for this new set of rows, I want to change DV from 0 to 1 for "results" number (from the original row) of the new rows.

For example, in row 33 of the original data set: Reach = 1004, Results = 45, DV = 0. The new data set should have row 33 replicated 1004 times, for 45 of those new rows DV should be changed from 0 to 1.

The code I wrote for the task works... but it is taking 10+ hours to run because the file is so large. Any ideas for how to simplify this code so it can process more quickly

empty_new.video <- new.video[FALSE,]
for(i in 1:nrow(new.video)){
  n.times <- new.video[i,'Reach'] #determine number of times to repeat rows
  if (n.times > 0){
    for (j in 1:n.times){
      empty_new.video[nrow(empty_new.video) + 1 , ] <- new.video[i,]
    }
  }
  dv.times <- new.video[i,'Results'] #creating dependent variable 
  if (dv.times>0){
    for (k in 1:dv.times){
      empty_new.video[nrow(empty_new.video) - n.times + k,'DV'] <- 1
    }
  }
}

Avoid growing objects in loop. Consider Map (wrapper to mapply ) to iterate through all original dataset's columns elementwise to build a list of data frames to eventually concatenate once at the end.

build_rows <- function(reach, results) {
    # DATA FRAME TO REPLICATE REACH BY ITS LENGTH
    df <- data.frame(id = reach, reach = 1:reach, dv = 0)

    # RANDOMLY ASSIGN N ROWS TO 1 (N=RESULTS)  
    df$dv[sample(1:nrow(df), results),] = 1 

    # ASSIGN FIRST N ROWS TO 1 (N=RESULTS)
    df$dv[1:results,] = 1 

    return(df)
}

df_list <- Map(build_rows, original_data$Reach, original_data$Results)

final_df <- do.call(rbind, df_list)

Rather than a loop to do everything at once, you could define a simple function that does this for one row and check the results

dd <- data.frame(Reach = c(5, 3), Results = c(4, 1), DV = c(0, 0))
#   Reach Results DV
# 1     5       4  0
# 2     3       1  0

f <- function(data) {
  nr <- data$Reach
  nd <- data$Results
  data <- data[rep_len(1L, nr), ]
  data$DV <- rep(0:1, c(nr - nd, nd))
  rownames(data) <- NULL
  data
}
f(dd[1, ])

Then loop for every row

res <- lapply(split(dd, rownames(dd)), f)
do.call('rbind', res)
#     Reach Results DV
# 1.1     5       4  0
# 1.2     5       4  1
# 1.3     5       4  1
# 1.4     5       4  1
# 1.5     5       4  1
# 2.1     3       1  0
# 2.2     3       1  0
# 2.3     3       1  1

But really all you are doing is creating a vector of row indices and 0/1 values for DV , you could do that with rep

ii <- rep(1:nrow(dd), dd$Reach)

jj <- c(t(cbind(dd$Reach - dd$Results, dd$Results)))
dv <- rep(rep(0:1, nrow(dd)), jj)

within(dd[ii, ], {
  DV <- dv
})
#     Reach Results DV
# 1       5       4  0
# 1.1     5       4  1
# 1.2     5       4  1
# 1.3     5       4  1
# 1.4     5       4  1
# 2       3       1  0
# 2.1     3       1  0
# 2.2     3       1  1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM