[英]How can I simplify this code (r) in which I am using information from an original data set to create a new dataset?
I have a data set that I am trying to use to generate a different data set in R.我有一个数据集,我试图用它在 R 中生成不同的数据集。 The dataset has many columns;
数据集有很多列; but the three relevant columns for generating the new data set are "Reach", "Results", and "DV".
但生成新数据集的三个相关列是“Reach”、“Results”和“DV”。 Reach and results are numeric.
范围和结果是数字。 DV is binary with 0s and 1s.
DV 是二进制的 0 和 1。 In the original dataset, all rows have DV = 0.
在原始数据集中,所有行的 DV = 0。
For each row of the original data set, I am attempting to take one variable "Reach" and replicate that row "reach" number of times.对于原始数据集的每一行,我试图取一个变量“Reach”并复制该行“reach”次数。 Then for this new set of rows, I want to change DV from 0 to 1 for "results" number (from the original row) of the new rows.
然后对于这组新行,我想将新行的“结果”数(来自原始行)的 DV 从 0 更改为 1。
For example, in row 33 of the original data set: Reach = 1004, Results = 45, DV = 0. The new data set should have row 33 replicated 1004 times, for 45 of those new rows DV should be changed from 0 to 1.例如,在原始数据集的第 33 行:Reach = 1004,Results = 45,DV = 0。新数据集的第 33 行应复制 1004 次,其中 45 行 DV 应从 0 更改为 1 .
The code I wrote for the task works... but it is taking 10+ hours to run because the file is so large.我为该任务编写的代码有效……但由于文件太大,需要 10 多个小时才能运行。 Any ideas for how to simplify this code so it can process more quickly
有关如何简化此代码以便更快处理的任何想法
empty_new.video <- new.video[FALSE,]
for(i in 1:nrow(new.video)){
n.times <- new.video[i,'Reach'] #determine number of times to repeat rows
if (n.times > 0){
for (j in 1:n.times){
empty_new.video[nrow(empty_new.video) + 1 , ] <- new.video[i,]
}
}
dv.times <- new.video[i,'Results'] #creating dependent variable
if (dv.times>0){
for (k in 1:dv.times){
empty_new.video[nrow(empty_new.video) - n.times + k,'DV'] <- 1
}
}
}
Avoid growing objects in loop.避免在循环中增长对象。 Consider
Map
(wrapper to mapply
) to iterate through all original dataset's columns elementwise to build a list of data frames to eventually concatenate once at the end.考虑
Map
(包装器到mapply
)逐元素迭代所有原始数据集的列,以构建数据帧列表,最终在最后连接一次。
build_rows <- function(reach, results) {
# DATA FRAME TO REPLICATE REACH BY ITS LENGTH
df <- data.frame(id = reach, reach = 1:reach, dv = 0)
# RANDOMLY ASSIGN N ROWS TO 1 (N=RESULTS)
df$dv[sample(1:nrow(df), results),] = 1
# ASSIGN FIRST N ROWS TO 1 (N=RESULTS)
df$dv[1:results,] = 1
return(df)
}
df_list <- Map(build_rows, original_data$Reach, original_data$Results)
final_df <- do.call(rbind, df_list)
Rather than a loop to do everything at once, you could define a simple function that does this for one row and check the results您可以定义一个简单的 function 而不是一个循环来一次完成所有操作,然后对一行执行此操作并检查结果
dd <- data.frame(Reach = c(5, 3), Results = c(4, 1), DV = c(0, 0))
# Reach Results DV
# 1 5 4 0
# 2 3 1 0
f <- function(data) {
nr <- data$Reach
nd <- data$Results
data <- data[rep_len(1L, nr), ]
data$DV <- rep(0:1, c(nr - nd, nd))
rownames(data) <- NULL
data
}
f(dd[1, ])
Then loop for every row然后循环每一行
res <- lapply(split(dd, rownames(dd)), f)
do.call('rbind', res)
# Reach Results DV
# 1.1 5 4 0
# 1.2 5 4 1
# 1.3 5 4 1
# 1.4 5 4 1
# 1.5 5 4 1
# 2.1 3 1 0
# 2.2 3 1 0
# 2.3 3 1 1
But really all you are doing is creating a vector of row indices and 0/1 values for DV
, you could do that with rep
但实际上您所做的只是为
DV
创建行索引和 0/1 值的向量,您可以使用rep
ii <- rep(1:nrow(dd), dd$Reach)
jj <- c(t(cbind(dd$Reach - dd$Results, dd$Results)))
dv <- rep(rep(0:1, nrow(dd)), jj)
within(dd[ii, ], {
DV <- dv
})
# Reach Results DV
# 1 5 4 0
# 1.1 5 4 1
# 1.2 5 4 1
# 1.3 5 4 1
# 1.4 5 4 1
# 2 3 1 0
# 2.1 3 1 0
# 2.2 3 1 1
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.