Group dataframe by multiple variables in R

Question

I have a dataframe containing 5713 rows and 7 columns. Many of the rows are duplicates. I need to create groups of 5 by "gender" and "size" while ensuring the "item" column does not contain duplicates and the "type" column only contains a maximum of 1 "Fleece". I have tried sample, split, group_by, sample_n, but can't seem to figure out how to include all the variables.

Here is a sample of the dataframe:

                  SKU          UPC   type rating size gender  item
1  M3MEN-SU15-BLU-XXL 628012010215    Tee      5  XXL      M M3MEN
2  M3MEN-SU15-GRY-XXL 628012010314    Tee      5  XXL      M M3MEN
3   M3MEN-SU15-GRY-XL 628012010316   Tank      5   XL      M M3MEN
4          MAMA-CHA-S *MAMA-CHA-S*   Tank      5    S      M  MAMA
5          MAMA-CHA-S *MAMA-CHA-S*    Tee      5    S      M  MAMA
6          MBAN-CHA-M *MBAN-CHA-M* Fleece      3    M      W  MBAN
7          WAZA-CHA-L *WAZA-CHA-L* Fleece      3    L      M  WAZA
8          MBAN-CHA-M *MBAN-CHA-M* Fleece      3    M      W  MBAN
9          MBAN-CHA-M *MBAN-CHA-M* Fleece      3    M      M  MBAN
10         MCON-CHA-M *MCON-CHA-M* Fleece      3    M      M  MCON

Ideally I would like to create a new column that creates a unique ID for each group of 5.

For example:

                  SKU          UPC   type rating size gender  item  id
1    M3MEN-SU15-BLU-S 628012010215    Tee      5    S      M M3MEN   1
2          MAMA-CHA-S *MAMA-CHA-S*   Tank      5    S      M  MAMA   1
3          MBAN-CHA-S *MBAN-CHA-S*   Tank      3    S      M  MBAN   1
4          MAZA-CHA-S *MAZA-CHA-S*    Tee      3    S      M  MAZA   1
5          MCON-CHA-S *MCON-CHA-S* Fleece      3    S      M  MCON   1  
6    W3MEN-SU15-BLU-M 428012010215    Tee      2    M      W W3WOM   2
7          WAMA-CHA-M *WAMA-CHA-M*   Tank      4    M      W  MAMA   2
8          WBAN-CHA-M *WBAN-CHA-M*   Tank      5    M      W  MBAN   2
9          WAZA-CHA-M *WAZA-CHA-M*    Tee      1    M      W  MAZA   2
10         WCON-CHA-M *WCON-CHA-M* Fleece      3    M      W  MCON   2

I have been struggling with this for awhile now. Any help would be greatly appreciated!

Answer 1

Avoiding duplicates of item within a group is straightforward with the distinct function:

library(dplyr)
df %>%
  group_by(gender, size) %>%
  distinct(item)

Ensuring there is not more than one "Fleece" is a bit trickier, but doable with filter and cumsum . This removes all but the first Fleece (within each group).

  filter(!(type == "Fleece" & cumsum(type == "Fleece") > 1))

Then you can do sample_n as you attempted originally:

  sample_n(5)

In total, your code is:

df <- df %>%
  group_by(gender, size) %>%
  distinct(item) %>%
  filter(cumsum(type == "Fleece") <= 1) %>%
  sample_n(5)

Group dataframe by multiple variables in R

Question

1 answers

solution1
0 2015-10-06 20:06:29

Group dataframe by multiple variables in R

Question

1 answers

solution1 0 2015-10-06 20:06:29

solution1
0 2015-10-06 20:06:29