简体   繁体   English

R-数据框追加第一行

[英]R - Data frame append first row

Unfortunately I got stuck and need your help. 不幸的是,我被困住了,需要您的帮助。

I am initializing a data frame and try to fill it with new rows in a loop. 我正在初始化数据框,并尝试在循环中用新行填充它。 It almost works as it should, only the first row gets an "NA" for the row.names value. 它几乎可以正常工作,只有第一行的row.names值获得“ NA”。 Can anyone propose a solution for this and/or explain why this happens? 谁能为此提出解决方案和/或解释为什么会发生这种情况?

I am using the f3 approach from the answer in this question: How to append rows to an R data frame 我正在从这个问题的答案中使用f3方法: 如何将行追加到R数据帧

Example: 例:

df <- data.frame( "Type" = character(), 
                  "AvgError" = numeric(), 
                  "StandardDeviation"= numeric (), 
                  stringsAsFactors=FALSE)

for (i in 1:3){
  df[nrow(df) + 1, ]$Type           <- paste("Test", as.character(format(round(i, 2), nsmall = 2)))
  df[nrow(df), ]$AvgError           <- i/10
  df[nrow(df), ]$StandardDeviation  <- i/100
}

df
        Type AvgError StandardDeviation
NA Test 1.00      0.1              0.01
2  Test 2.00      0.2              0.02
3  Test 3.00      0.3              0.03

If I can provide any more informations, please comment and I will try to provide what I can. 如果我可以提供更多信息,请发表评论,我将尽力提供。 Thanks for the help. 谢谢您的帮助。

Edit: Ok, thx for the discussion so far. 编辑:好的,到目前为止的讨论。 I understand (and knew already before) that this is not the super best way to do this, because it is much slower than a functional approach but execution time is not important in this case. 我了解(并且之前已经知道)这不是执行此操作的最佳方法,因为它比功能方法慢得多,但是在这种情况下执行时间并不重要。 A work-around has been provided in the comments by @MrFlick, by just renaming the row.names at the end ( rownames(df)<-1:nrow(df) ). @MrFlick的注释中提供了一种变通方法,只需在末尾重命名row.names(rownames rownames(df)<-1:nrow(df) )。 Anyway this helps, but it still feels unsatisfying to me since it doesn't treat the cause but only deals with the symptoms. 无论如何,这都是有帮助的,但它仍然使我感到不满意,因为它不能治疗病因,而只能解决症状。

Growing data frames by appending one row at a time makes your code inefficient because you need to continue reallocating the entire space for your data frame at each iteration. 通过一次追加一行来增长数据帧会使代码效率低下,因为您需要在每次迭代中继续为数据帧重新分配整个空间。 Especially as you grow to large object sizes, this can cause your code to be quite slow. 特别是随着对象大小的增长,这可能会导致代码运行缓慢。 You can read all about this issue in Circle 2 of the R inferno . 您可以在R inferno的 Circle 2中阅读有关此问题的所有信息。

As an example, consider your code versus a similar code that computes each row of the data frame separately and then combines them together at the end with do.call and rbind : 例如,考虑您的代码与类似的代码,后者分别计算数据帧的每一行,然后在最后将它们与do.callrbind组合在一起:

OP <- function(vals) {
  df <- data.frame( "Type" = character(), 
                    "AvgError" = numeric(), 
                    "StandardDeviation"= numeric (), 
                    stringsAsFactors=FALSE)
  for (i in vals){
    df[nrow(df) + 1, ]$Type           <- paste("Test", as.character(format(round(i, 2), nsmall = 2)))
    df[nrow(df), ]$AvgError           <- i/10
    df[nrow(df), ]$StandardDeviation  <- i/100
  }
  row.names(df) <- vals
  df
}

josilber <- function(vals) {
  ret <- do.call(rbind, lapply(vals, function(x) {
    data.frame(Type=paste("Test", as.character(format(round(x, 2), nsmall = 2))),
               AvgError = x/10,
               StandardDeviation = x/100,
               stringsAsFactors=FALSE)
  }))
  ret
}

all.equal(OP(1:10000), josilber(1:10000))
# [1] TRUE
system.time(OP(1:10000))
#    user  system elapsed 
#  17.849   1.325  19.147 
system.time(josilber(1:10000))
#    user  system elapsed 
#   4.685   0.027   4.713 

The code that waits until the end to combine each row is 4 times faster than the code that continuously appends to the data frame for a data frame of length 10,000. 等到最后合并每行的代码比长度为10,000的数据帧连续追加到数据帧的代码快4倍。 Basically you've introduced 15 seconds of delay for memory reallocation that had nothing to do with the per-row computation, and that's only for a data frame with 10,000 rows. 基本上,您为存储器重新分配引入了15秒的延迟,这与每行计算无关,并且仅适用于具有10,000行的数据帧。 The wasted computation is up to 64 seconds for data frames of length 20,000: 对于长度为20,000的数据帧,浪费的计算最长为64秒:

system.time(OP(1:20000))
#    user  system elapsed 
#  70.755   7.065  77.717 
system.time(josilber(1:20000))
#    user  system elapsed 
#  12.502   0.968  13.470 

As noted in the comments, there are much quicker ways to build these particular data frames (computing each variable in one shot with vectorized functions), but I've limited my function josilber to code that computes each row one-by-one to demonstrate that appending can still have significant performance implications. 正如评论中所指出的那样,有很多更快的方法来构建这些特定的数据帧(使用矢量化函数一次计算每个变量),但是我限制了函数josilber计算每一行的代码来演示该附加仍可能对性能产生重大影响。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM