根據指定列中的最小值從數據框中獲取 X 行；並在關系中隨機選擇

Question

dplyr 中的 top_n 幾乎是我想要的； 然而，當有關系時，它會返回全部，但我希望它在關系中隨機選擇以滿足我的 2 行截止。

top_number=4
x <- c(1, 2, 3, 3, 6, 6, 6)
y <- c(3, 2, 5, 1, 2, 3, 3)
xy <- data.frame(x, y) 
xy
xy_1 <- dplyr::top_n(xy, top_number, wt = x)
xy_1

請注意，三個 6 應始終在 x 中選擇，然后隨機選擇應在兩個 3 中。
使用 tidyverse 解決方案會很好。

Answer 1

另一種選擇可能是：

xy %>%
 top_n(top_number, wt = x) %>%
 sample_n(top_number)

要解決更新的問題，添加purrr ：

xy %>%
 top_n(top_number, wt = x) %>%
 add_count(x, name = "n_all") %>%
 add_count(x, y) %>%
 group_split(n) %>%
 map_dfr(~ mutate(., cond = if_else(n != n_all, 1, top_number)) %>%
          sample_n(cond) %>%
          select(x, y))

Answer 2

得到top_n行后，我們可以根據row_number()隨機sample進行slice

library(dplyr)
top_n(xy, top_number, wt = x) %>% 
   arrange(desc(x)) %>% 
   slice(c(seq_len(top_number -1), sample(top_number:n(), 1)))

Answer 3

如果您對確定性解決方案感到滿意，其中始終選擇具有最小值的第一行，您可以這樣做：

# non_random
xy %>% slice( order(desc(x)) %>% head(top_number) )

這種方法最終比使用隨機性在具有最小值的行組之間進行選擇要快得多。

但是，如果您需要隨機性，但不需要對結果進行排序，則可以這樣做：

# random_unordered
xy %>% 
    top_n(top_number, x) %>% 
    slice(c( seq_len(n())[x != min(x)], 
             sample(seq_len(n())[x == min(x)], n() - top_number) ))

如果您需要隨機性和有序輸出，則可以使用@akrun 提供的解決方案

我用non_random測試了這 3 種方法，命名第一種方法non_random ，第二種方法random_unordered和 akrun 的一種random_ordered 。 測試是在具有不同行數的數據幀上完成的，並采用了超過 100 次執行的每種方法的中值運行時間。 這是結果

根據指定列中的最小值從數據框中獲取 X 行；並在關系中隨機選擇

問題描述

3 個解決方案

解決方案1
3 2020-01-07 19:41:16

解決方案2
2 已采納 2020-01-07 19:38:37

解決方案3
1 2020-01-07 23:04:32

根據指定列中的最小值從數據框中獲取 X 行； 並在關系中隨機選擇

問題描述

3 個解決方案

解決方案1 3 2020-01-07 19:41:16

解決方案2 2 已采納 2020-01-07 19:38:37

解決方案3 1 2020-01-07 23:04:32

根據指定列中的最小值從數據框中獲取 X 行；並在關系中隨機選擇

解決方案1
3 2020-01-07 19:41:16

解決方案2
2 已采納 2020-01-07 19:38:37

解決方案3
1 2020-01-07 23:04:32