I have a data set that I am trying to use to generate a different data set in R. The dataset has many columns; but the three relevant columns for generating the new data set are "Reach", "Results", and "DV". Reach and results are numeric. DV is binary with 0s and 1s. In the original dataset, all rows have DV = 0.
For each row of the original data set, I am attempting to take one variable "Reach" and replicate that row "reach" number of times. Then for this new set of rows, I want to change DV from 0 to 1 for "results" number (from the original row) of the new rows.
For example, in row 33 of the original data set: Reach = 1004, Results = 45, DV = 0. The new data set should have row 33 replicated 1004 times, for 45 of those new rows DV should be changed from 0 to 1.
The code I wrote for the task works... but it is taking 10+ hours to run because the file is so large. Any ideas for how to simplify this code so it can process more quickly
empty_new.video <- new.video[FALSE,]
for(i in 1:nrow(new.video)){
n.times <- new.video[i,'Reach'] #determine number of times to repeat rows
if (n.times > 0){
for (j in 1:n.times){
empty_new.video[nrow(empty_new.video) + 1 , ] <- new.video[i,]
}
}
dv.times <- new.video[i,'Results'] #creating dependent variable
if (dv.times>0){
for (k in 1:dv.times){
empty_new.video[nrow(empty_new.video) - n.times + k,'DV'] <- 1
}
}
}
Avoid growing objects in loop. Consider Map
(wrapper to mapply
) to iterate through all original dataset's columns elementwise to build a list of data frames to eventually concatenate once at the end.
build_rows <- function(reach, results) {
# DATA FRAME TO REPLICATE REACH BY ITS LENGTH
df <- data.frame(id = reach, reach = 1:reach, dv = 0)
# RANDOMLY ASSIGN N ROWS TO 1 (N=RESULTS)
df$dv[sample(1:nrow(df), results),] = 1
# ASSIGN FIRST N ROWS TO 1 (N=RESULTS)
df$dv[1:results,] = 1
return(df)
}
df_list <- Map(build_rows, original_data$Reach, original_data$Results)
final_df <- do.call(rbind, df_list)
Rather than a loop to do everything at once, you could define a simple function that does this for one row and check the results
dd <- data.frame(Reach = c(5, 3), Results = c(4, 1), DV = c(0, 0))
# Reach Results DV
# 1 5 4 0
# 2 3 1 0
f <- function(data) {
nr <- data$Reach
nd <- data$Results
data <- data[rep_len(1L, nr), ]
data$DV <- rep(0:1, c(nr - nd, nd))
rownames(data) <- NULL
data
}
f(dd[1, ])
Then loop for every row
res <- lapply(split(dd, rownames(dd)), f)
do.call('rbind', res)
# Reach Results DV
# 1.1 5 4 0
# 1.2 5 4 1
# 1.3 5 4 1
# 1.4 5 4 1
# 1.5 5 4 1
# 2.1 3 1 0
# 2.2 3 1 0
# 2.3 3 1 1
But really all you are doing is creating a vector of row indices and 0/1 values for DV
, you could do that with rep
ii <- rep(1:nrow(dd), dd$Reach)
jj <- c(t(cbind(dd$Reach - dd$Results, dd$Results)))
dv <- rep(rep(0:1, nrow(dd)), jj)
within(dd[ii, ], {
DV <- dv
})
# Reach Results DV
# 1 5 4 0
# 1.1 5 4 1
# 1.2 5 4 1
# 1.3 5 4 1
# 1.4 5 4 1
# 2 3 1 0
# 2.1 3 1 0
# 2.2 3 1 1
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.