[英]dplyr sample_n by one variable through another one
我有一个带有“分组”可变season
和另一个可变year
的数据框,每个月重复一次。
df <- data.frame(month = as.character(sapply(month.name,function(x)rep(x,4))),
season = c(rep("winter",8),rep("spring",12),rep("summer",12),rep("autumn",12),rep("winter",4)),
year = rep(2021:2024,12))
我想使用dplyr::sample_n
或类似的方法在每个季节的数据框中选择2个月,并在所有年份中保留相同的月份,例如:
month season year
1 January winter 2021
2 January winter 2022
3 January winter 2023
4 January winter 2024
5 February winter 2021
6 February winter 2022
7 February winter 2023
8 February winter 2024
9 March spring 2021
10 March spring 2022
11 March spring 2023
12 March spring 2024
13 May spring 2021
14 May spring 2022
15 May spring 2023
16 May spring 2024
17 June summer 2021
18 June summer 2022
19 June summer 2023
20 June summer 2024
21 July summer 2021
22 July summer 2022
23 July summer 2023
24 July summer 2024
25 October autumn 2021
26 October autumn 2022
27 October autumn 2023
28 October autumn 2024
29 November autumn 2021
30 November autumn 2022
31 November autumn 2023
32 November autumn 2024
我无法使df %>% group_by(season,year) %>% sample_n(2)
因为它每年选择不同的月份。
谢谢!
我们可以从month
随机sample
2个值,然后按组filter
它们。
library(dplyr)
df %>%
group_by(season) %>%
filter(month %in% sample(unique(month),2))
# month season year
# <chr> <chr> <int>
# 1 January winter 2021
# 2 January winter 2022
# 3 January winter 2023
# 4 January winter 2024
# 5 February winter 2021
# 6 February winter 2022
# 7 February winter 2023
# 8 February winter 2024
# 9 March spring 2021
#10 March spring 2022
# … with 22 more rows
如果对于某些组,少于2个unique
值,我们可以在要sample
的组中的2个和唯一值之间选择min
。
df %>%
group_by(season) %>%
filter(month %in% sample(unique(month),min(2, n_distinct(month))))
对基数R使用相同的逻辑,我们可以使用ave
df[as.logical(with(df, ave(month, season,
FUN = function(x) x %in% sample(unique(x),2)))), ]
使用slice
的选项
library(dplyr)
df %>%
group_by(season) %>%
slice(which(!is.na(match(month, sample(unique(month), 2)))))
# A tibble: 32 x 3
# Groups: season [4]
# month season year
# <fct> <fct> <int>
# 1 October autumn 2021
# 2 October autumn 2022
# 3 October autumn 2023
# 4 October autumn 2024
# 5 November autumn 2021
# 6 November autumn 2022
# 7 November autumn 2023
# 8 November autumn 2024
# 9 April spring 2021
#10 April spring 2022
# … with 22 more rows
或使用base R
by(df, df$season, FUN = function(x) subset(x, month %in% sample(unique(month), 2 )))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.