I have a dataframe containing 5713 rows and 7 columns. Many of the rows are duplicates. I need to create groups of 5 by "gender" and "size" while ensuring the "item" column does not contain duplicates and the "type" column only contains a maximum of 1 "Fleece". I have tried sample, split, group_by, sample_n, but can't seem to figure out how to include all the variables.
Here is a sample of the dataframe:
SKU UPC type rating size gender item
1 M3MEN-SU15-BLU-XXL 628012010215 Tee 5 XXL M M3MEN
2 M3MEN-SU15-GRY-XXL 628012010314 Tee 5 XXL M M3MEN
3 M3MEN-SU15-GRY-XL 628012010316 Tank 5 XL M M3MEN
4 MAMA-CHA-S *MAMA-CHA-S* Tank 5 S M MAMA
5 MAMA-CHA-S *MAMA-CHA-S* Tee 5 S M MAMA
6 MBAN-CHA-M *MBAN-CHA-M* Fleece 3 M W MBAN
7 WAZA-CHA-L *WAZA-CHA-L* Fleece 3 L M WAZA
8 MBAN-CHA-M *MBAN-CHA-M* Fleece 3 M W MBAN
9 MBAN-CHA-M *MBAN-CHA-M* Fleece 3 M M MBAN
10 MCON-CHA-M *MCON-CHA-M* Fleece 3 M M MCON
Ideally I would like to create a new column that creates a unique ID for each group of 5.
For example:
SKU UPC type rating size gender item id
1 M3MEN-SU15-BLU-S 628012010215 Tee 5 S M M3MEN 1
2 MAMA-CHA-S *MAMA-CHA-S* Tank 5 S M MAMA 1
3 MBAN-CHA-S *MBAN-CHA-S* Tank 3 S M MBAN 1
4 MAZA-CHA-S *MAZA-CHA-S* Tee 3 S M MAZA 1
5 MCON-CHA-S *MCON-CHA-S* Fleece 3 S M MCON 1
6 W3MEN-SU15-BLU-M 428012010215 Tee 2 M W W3WOM 2
7 WAMA-CHA-M *WAMA-CHA-M* Tank 4 M W MAMA 2
8 WBAN-CHA-M *WBAN-CHA-M* Tank 5 M W MBAN 2
9 WAZA-CHA-M *WAZA-CHA-M* Tee 1 M W MAZA 2
10 WCON-CHA-M *WCON-CHA-M* Fleece 3 M W MCON 2
I have been struggling with this for awhile now. Any help would be greatly appreciated!
Avoiding duplicates of item
within a group is straightforward with the distinct
function:
library(dplyr)
df %>%
group_by(gender, size) %>%
distinct(item)
Ensuring there is not more than one "Fleece" is a bit trickier, but doable with filter
and cumsum
. This removes all but the first Fleece (within each group).
filter(!(type == "Fleece" & cumsum(type == "Fleece") > 1))
Then you can do sample_n
as you attempted originally:
sample_n(5)
In total, your code is:
df <- df %>%
group_by(gender, size) %>%
distinct(item) %>%
filter(cumsum(type == "Fleece") <= 1) %>%
sample_n(5)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.