简体   繁体   English

在我使用来自原始数据集的信息创建新数据集时,如何简化此代码 (r)?

[英]How can I simplify this code (r) in which I am using information from an original data set to create a new dataset?

I have a data set that I am trying to use to generate a different data set in R.我有一个数据集,我试图用它在 R 中生成不同的数据集。 The dataset has many columns;数据集有很多列; but the three relevant columns for generating the new data set are "Reach", "Results", and "DV".但生成新数据集的三个相关列是“Reach”、“Results”和“DV”。 Reach and results are numeric.范围和结果是数字。 DV is binary with 0s and 1s. DV 是二进制的 0 和 1。 In the original dataset, all rows have DV = 0.在原始数据集中,所有行的 DV = 0。

For each row of the original data set, I am attempting to take one variable "Reach" and replicate that row "reach" number of times.对于原始数据集的每一行,我试图取一个变量“Reach”并复制该行“reach”次数。 Then for this new set of rows, I want to change DV from 0 to 1 for "results" number (from the original row) of the new rows.然后对于这组新行,我想将新行的“结果”数(来自原始行)的 DV 从 0 更改为 1。

For example, in row 33 of the original data set: Reach = 1004, Results = 45, DV = 0. The new data set should have row 33 replicated 1004 times, for 45 of those new rows DV should be changed from 0 to 1.例如,在原始数据集的第 33 行:Reach = 1004,Results = 45,DV = 0。新数据集的第 33 行应复制 1004 次,其中 45 行 DV 应从 0 更改为 1 .

The code I wrote for the task works... but it is taking 10+ hours to run because the file is so large.我为该任务编写的代码有效……但由于文件太大,需要 10 多个小时才能运行。 Any ideas for how to simplify this code so it can process more quickly有关如何简化此代码以便更快处理的任何想法

empty_new.video <- new.video[FALSE,]
for(i in 1:nrow(new.video)){
  n.times <- new.video[i,'Reach'] #determine number of times to repeat rows
  if (n.times > 0){
    for (j in 1:n.times){
      empty_new.video[nrow(empty_new.video) + 1 , ] <- new.video[i,]
    }
  }
  dv.times <- new.video[i,'Results'] #creating dependent variable 
  if (dv.times>0){
    for (k in 1:dv.times){
      empty_new.video[nrow(empty_new.video) - n.times + k,'DV'] <- 1
    }
  }
}

Avoid growing objects in loop.避免在循环中增长对象。 Consider Map (wrapper to mapply ) to iterate through all original dataset's columns elementwise to build a list of data frames to eventually concatenate once at the end.考虑Map (包装器到mapply )逐元素迭代所有原始数据集的列,以构建数据帧列表,最终在最后连接一次

build_rows <- function(reach, results) {
    # DATA FRAME TO REPLICATE REACH BY ITS LENGTH
    df <- data.frame(id = reach, reach = 1:reach, dv = 0)

    # RANDOMLY ASSIGN N ROWS TO 1 (N=RESULTS)  
    df$dv[sample(1:nrow(df), results),] = 1 

    # ASSIGN FIRST N ROWS TO 1 (N=RESULTS)
    df$dv[1:results,] = 1 

    return(df)
}

df_list <- Map(build_rows, original_data$Reach, original_data$Results)

final_df <- do.call(rbind, df_list)

Rather than a loop to do everything at once, you could define a simple function that does this for one row and check the results您可以定义一个简单的 function 而不是一个循环来一次完成所有操作,然后对一行执行此操作并检查结果

dd <- data.frame(Reach = c(5, 3), Results = c(4, 1), DV = c(0, 0))
#   Reach Results DV
# 1     5       4  0
# 2     3       1  0

f <- function(data) {
  nr <- data$Reach
  nd <- data$Results
  data <- data[rep_len(1L, nr), ]
  data$DV <- rep(0:1, c(nr - nd, nd))
  rownames(data) <- NULL
  data
}
f(dd[1, ])

Then loop for every row然后循环每一行

res <- lapply(split(dd, rownames(dd)), f)
do.call('rbind', res)
#     Reach Results DV
# 1.1     5       4  0
# 1.2     5       4  1
# 1.3     5       4  1
# 1.4     5       4  1
# 1.5     5       4  1
# 2.1     3       1  0
# 2.2     3       1  0
# 2.3     3       1  1

But really all you are doing is creating a vector of row indices and 0/1 values for DV , you could do that with rep但实际上您所做的只是为DV创建行索引和 0/1 值的向量,您可以使用rep

ii <- rep(1:nrow(dd), dd$Reach)

jj <- c(t(cbind(dd$Reach - dd$Results, dd$Results)))
dv <- rep(rep(0:1, nrow(dd)), jj)

within(dd[ii, ], {
  DV <- dv
})
#     Reach Results DV
# 1       5       4  0
# 1.1     5       4  1
# 1.2     5       4  1
# 1.3     5       4  1
# 1.4     5       4  1
# 2       3       1  0
# 2.1     3       1  0
# 2.2     3       1  1

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何简化 R 中的相关代码? - How can I simplify a correlation code in R? 如何简化R中的代码? - How can I simplify my code in R? 我是Shiny的新手,我想使用Iris数据集(这是R中的一个包)将一个简单的应用程序放在一起: - I am new to shiny and I am trying to put a simple app together using the iris data set, which is a package in R: 我是 r 的新手。 如何将数据框变量值从数字转换为名称? 请参阅下面的代码 - I am new to r. How can I convert a data frame variable value from a number to a name? See code below 如何将原始月平均数据集中的特定数据添加到新数据集中? - How do I add specific data from the original monthly average dataset to a new dataset? 如何使用 R Studio 中旧数据框的精确行创建新数据框? - how can I create a new data frame using exact rows from the old data frame in R Studio? 如何根据条件从R中的大型数据集中删除一组特定数据? - How can I remove a set of specific data, based on a condition, from a large dataset in R? 如何简化我的 R 代码以使其更短? - How can I simplify my R-code to make it shorter? 如何使用 R 中的数据集的值创建一个矩阵 function? - How can I create a function that creates a matrix using values from my dataset in R? 如何创建包含 3 个新列(每列来自不同数据集)的数据集并重命名它们? - How can I create a dataset with 3 new columns (each one from a different dataset) and rename them?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM