为什么 tibble 在逐行比较中的执行速度比 data.frame 慢

Question

我正在将旧代码库转换为 tidyverse，我注意到特定步骤的性能下降； 因为我现在使用readr ( read_delim ) 来读取我的数据，所以我最终得到了一个tibble而不是之前的基础 R data.frame ( read.delim ) — 这很好。

总之，当使用tibble上的行方向相比，计算时间减少在大约10倍，相对于常规的data.frame 。

这是我的代码：

library(tidyverse)

# Data
df <- tribble(
  ~x_pos, ~y_pos,
  0.0,  5.0,
  NA,   NA,
  0.1,  0.9,
  1.1,  1.5,
  1.7,  2.0,
  3.2,  1.0,
  4.0,  1.5,
  4.1,  5.0,
)

# Defining Regions of interest
roi_set_top <- list(
  roi_list = list(
    roi1 = list(
      hit_name = "left",
      x1 = 1.0,
      y1 = 1.0,
      x2 = 2.0,
      y2 = 2.0
    ),
    roi2 = list(
      hit_name = "right",
      x1 = 3.0,
      y1 = 1.0,
      x2 = 4.0,
      y2 = 2.0
    )
  )
)

# ⚡️ UNCOMMENT THIS LINE this line to convert the `tibble` to a `data.frame` and source the file again
# df <- as.data.frame(df)

start.time <- Sys.time()

for (bench in 1:1000) {
  roi_vector <- rep("NO EVAL", times = nrow(df))
  
  # loop over rows
  for (i in 1:nrow(df)) {
    
    # loop over the aoilist
    for (roi in roi_set_top$roi_list) {
      
      # check if either x or y is NA (or both) if so return NA
      if (is.na(df[i, "x_pos"]) || is.na(df[i, "y_pos"])) {
        roi_vector[i] <- "No X/Y"
        break
      }
      
      # check the hit area
      if (df[i, "x_pos"] >= roi$x1 && df[i, "y_pos"] >= roi$y1 &&
          df[i, "x_pos"] <= roi$x2 && df[i, "y_pos"] <= roi$y2) {
        roi_vector[i] <- roi$hit_name
        break
      }
      
      # Finally, if current row’s x and y is neither NA nor in hit range assign Outside ROI
      roi_vector[i] <- "Outside ROI"
    }
  }
}

end.time <- Sys.time()
time.taken <- end.time - start.time
print(time.taken)

比较

当你作为是源代码，大约需要相比，当你取消注释与⚡️线，从将其转换为10倍以上tibble到data.frame 。

如果我愿意提取data.farme的向量，我可以恢复我的表现： x_pos <- df$x_pos; y_pos <- df$x_pos x_pos <- df$x_pos; y_pos <- df$x_pos并在循环中使用 vetors 而不是 df。 但是，我得到了一个基本问题

问题

与基本 R data.frame相比，为什么tibble在逐行比较中的执行速度较慢？

作为最佳实践风格的后续行动； 当一个人只需要使用向量时，使用 df 似乎是一种不好的做法。 因此，应该不断迭代向量而不是 df 中的列？

Answer 1

主要原因是 tibbles 在子集化时返回 tibbles，而数据帧有时返回向量。 在您的示例中，这显示在评估df[i, "x_pos"] ，如果df是小标题，则它是小标题，但如果df是数据帧，则它是数字标量。 这使得像is.na(df[i, "x_pos"])要慢得多。

每次你真的想要一个向量或标量时，你都会通过添加drop = TRUE获得更快的速度（我看到所用的时间减少了 25%），但更好的主意是在循环外转换为向量避免在 tibble 中进行所有这些个人访问。 例如这段代码：

start.time <- Sys.time()

for (bench in 1:1000) {
  roi_vector <- rep("NO EVAL", times = nrow(df))
  # loop over rows
  x_pos <- df$x_pos
  y_pos <- df$y_pos
  for (i in 1:nrow(df)) {
    # loop over the aoilist
    for (roi in roi_set_top$roi_list) {
      # check if either x or y is NA (or both) if so return NA
      if (is.na(x_pos[i]) || is.na(y_pos[i])) {
        roi_vector[i] <- "No X/Y"
        break
      }
      # check the hit area
      if (x_pos[i] >= roi$x1 && y_pos[i] >= roi$y1 &&
          x_pos[i] <= roi$x2 && y_pos[i] <= roi$y2) {
        roi_vector[i] <- roi$hit_name
        break
      }
      # Finally, if current row’s x and y is neither NA nor in hit range assign Outside ROI
      roi_vector[i] <- "Outside ROI"
    }
  }
}
end.time <- Sys.time()
time.taken <- end.time - start.time
print(time.taken)

比我系统上的原始代码快 60 倍。

为什么 tibble 在逐行比较中的执行速度比 data.frame 慢

问题描述

比较

问题

1 个解决方案

解决方案1
1 已采纳 2021-10-24 12:50:48

为什么 tibble 在逐行比较中的执行速度比 data.frame 慢

问题描述

比较

问题

1 个解决方案

解决方案1 1 已采纳 2021-10-24 12:50:48

解决方案1
1 已采纳 2021-10-24 12:50:48