[英]Indexing data frame takes too long
I've got some code looking like this:我有一些看起来像这样的代码:
library(stringi)
df_values <- data.frame(value = stri_rand_strings(n = 500,
length = 30))
df_keys <- tibble(key = sample(x = 1:500,
size = 25000,
replace = TRUE))
# start timer
start_time <- Sys.time()
df_keys |>
rowwise() |>
mutate(value = df_values$value[key])
# end timer
end_time <- Sys.time()
end_time - start_time
Which requires very much time to run, but I can't figure out why.这需要很长时间才能运行,但我不知道为什么。 The code above only requires 0.3003931 seconds.上面的代码只需要 0.3003931 秒。 For my real code I subsetted the tibble with head(n)
and got following times:对于我的真实代码,我用head(n)
对 tibble 进行了子集化,得到了以下时间:
n n | time in secs以秒为单位的时间 |
---|---|
50 50 | 1.993536 1.993536 |
100 100 | 3.731 3.731 |
200 200 | 6.550074 6.550074 |
300 300 | 9.500864 9.500864 |
500 500 | 15.68515 15.68515 |
1,000 1,000 | 32.19306 32.19306 |
... ... | seems to be linear似乎是线性的 |
20,000 20,000 | maybe 10 minutes也许10分钟 |
Does someone have an idea what could be wrong with my code?有人知道我的代码有什么问题吗? I guess it's the indexing-part df_values$value[key]
?我猜这是索引部分df_values$value[key]
? But my original df_values
also is a data.frame with 500 obs.但我原来的df_values
也是一个 500 obs 的 data.frame。
A possible solution, in base R
.一个可能的解决方案,以base R
。 As we can see, the execution time takes only 1% of the time, compared to your dplyr
approach.正如我们所看到的,与您的dplyr
方法相比,执行时间只需要 1% 的时间。 Even removing rowwise
, the execution time is extremely faster with a base R
approach.即使删除rowwise
,使用base R
方法的执行时间也非常快。
library(tidyverse)
library(stringi)
# start timer
start_time <- Sys.time()
df_keys |>
rowwise() |>
mutate(value = df_values$value[key])
#> # A tibble: 25,000 × 2
#> # Rowwise:
#> key value
#> <int> <chr>
#> 1 287 BeFLZsuRxlKJAJLgOnH1SO2f6kjpPH
#> 2 292 yG1JoxKRzSDnBlk4fJKDcKwzAUGwOy
#> 3 334 38pJ1h3RaTTSDgcf7gyCuW2NqFyncZ
#> 4 120 LqqCmTiMQV50hV0c0yYzk94AtpV7I6
#> 5 233 62BsX6NAEQqYx5wjm5ienCYgDmvJDb
#> 6 413 OB2MqTt1SOTb3irKlLEBtr4MfvuWW5
#> 7 123 4IKKUTli7c1l8GwU8TTpWHLHirGCy8
#> 8 400 aDnB9PwIKQkdfAW5kwzM215vU9aCNk
#> 9 214 aOsJkVENbncaHESiU2rwmfXqY5yVsK
#> 10 332 v4DfYVOr9kedtIwnWFlefDfFhHJ25R
#> # … with 24,990 more rows
# end timer
end_time <- Sys.time()
end_time - start_time
#> Time difference of 0.1876147 secs
start_time <- Sys.time()
df_keys$value <- df_values$value[df_keys$key]
end_time <- Sys.time()
end_time - start_time
#> Time difference of 0.002212286 secs
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.