索引数据帧耗时太长

Question

I've got some code looking like this:我有一些看起来像这样的代码：

library(stringi)

df_values <- data.frame(value = stri_rand_strings(n = 500,
                                                  length = 30))

df_keys <- tibble(key = sample(x = 1:500,
                               size = 25000,
                               replace = TRUE))

# start timer
start_time <- Sys.time()

df_keys |>
 rowwise() |>
 mutate(value = df_values$value[key])

# end timer
end_time <- Sys.time()

end_time - start_time

Which requires very much time to run, but I can't figure out why.这需要很长时间才能运行，但我不知道为什么。 The code above only requires 0.3003931 seconds.上面的代码只需要 0.3003931 秒。 For my real code I subsetted the tibble with head(n) and got following times:对于我的真实代码，我用head(n)对 tibble 进行了子集化，得到了以下时间：

n n	time in secs以秒为单位的时间
50 50	1.993536 1.993536
100 100	3.731 3.731
200 200	6.550074 6.550074
300 300	9.500864 9.500864
500 500	15.68515 15.68515
1,000 1,000	32.19306 32.19306
... ...	seems to be linear似乎是线性的
20,000 20,000	maybe 10 minutes也许10分钟

Does someone have an idea what could be wrong with my code?有人知道我的代码有什么问题吗？ I guess it's the indexing-part df_values$value[key] ?我猜这是索引部分df_values$value[key] ？ But my original df_values also is a data.frame with 500 obs.但我原来的df_values也是一个 500 obs 的 data.frame。

Answer 1

A possible solution, in base R .一个可能的解决方案，以base R 。 As we can see, the execution time takes only 1% of the time, compared to your dplyr approach.正如我们所看到的，与您的dplyr方法相比，执行时间只需要 1% 的时间。 Even removing rowwise , the execution time is extremely faster with a base R approach.即使删除rowwise ，使用base R方法的执行时间也非常快。

library(tidyverse)
library(stringi)

# start timer
start_time <- Sys.time()

df_keys |>
  rowwise() |>
  mutate(value = df_values$value[key])
#> # A tibble: 25,000 × 2
#> # Rowwise: 
#>      key value                         
#>    <int> <chr>                         
#>  1   287 BeFLZsuRxlKJAJLgOnH1SO2f6kjpPH
#>  2   292 yG1JoxKRzSDnBlk4fJKDcKwzAUGwOy
#>  3   334 38pJ1h3RaTTSDgcf7gyCuW2NqFyncZ
#>  4   120 LqqCmTiMQV50hV0c0yYzk94AtpV7I6
#>  5   233 62BsX6NAEQqYx5wjm5ienCYgDmvJDb
#>  6   413 OB2MqTt1SOTb3irKlLEBtr4MfvuWW5
#>  7   123 4IKKUTli7c1l8GwU8TTpWHLHirGCy8
#>  8   400 aDnB9PwIKQkdfAW5kwzM215vU9aCNk
#>  9   214 aOsJkVENbncaHESiU2rwmfXqY5yVsK
#> 10   332 v4DfYVOr9kedtIwnWFlefDfFhHJ25R
#> # … with 24,990 more rows

# end timer
end_time <- Sys.time()

end_time - start_time

#> Time difference of 0.1876147 secs

start_time <- Sys.time()
df_keys$value <- df_values$value[df_keys$key]
end_time <- Sys.time()

end_time - start_time

#> Time difference of 0.002212286 secs

索引数据帧耗时太长

问题描述

1 个解决方案

解决方案1
3 已采纳 2022-06-15 18:35:35

索引数据帧耗时太长

问题描述

1 个解决方案

解决方案1 3 已采纳 2022-06-15 18:35:35

解决方案1
3 已采纳 2022-06-15 18:35:35