简体   繁体   中英

Find the most common combinations within each group in R

I have the following dataset, showing the INGREDIENTS contained in each PRODUCT;

data <- data.frame("PRODUCT" = c("Creme","Creme","Creme","Creme","Medoc","Medoc","Medoc","Medoc","Medoc","Hububu","Hububu","Hububu","Hububu","Troll","Troll","Troll","Troll","Suzuki","Suzuki","Gluglu","Gluglu","Gluglu"), 
            "INGREDIENT" = c("zeze","zaza","zozo","zuzu","zaza","sasa","haha","zuzu","zemzem","zaza","zuzu","zizi","haha","zozo","zaza","zemzem","zuzu","sasa","zuzu","ozam","zaza","hayda"))

I want to know the most common combinations of INGREDIENTS in each PRODUCT; which ingredient is associated with which other ingredient ? I applied the code I found in this thread here :

combinaisons_par_PRODUCT = data %>% 
  full_join(data, by="PRODUCT") %>% 
  group_by(INGREDIENT.x, INGREDIENT.y) %>% 
  summarise(n = length(unique(PRODUCT))) %>% 
  filter(INGREDIENT.x!=INGREDIENT.y) %>%
  mutate(item = paste(INGREDIENT.x, INGREDIENT.y, sep=", "))

It works but there is one final flaw; I would like the order to be ignored. For instance, this code, would give me 1 association of HAHA and SASA, and also 1 association of SASA and HAHA. But for me, these are the same things. So I would like the code to ignore the order of INGREDIENTS and give me one unique association of 2 HAHA & SASA.

I tried sorting the INGREDIENTS before applying the code, but it didn't work either. Could someone help me please? How can I have these combinations unregarding the order ?

Thank you very much!

Does this do what you want? I'm limiting to only situations where the combos are in alphabetical order, avoiding double counts.

data %>% 
  full_join(data, by="PRODUCT") %>%
  filter(INGREDIENT.x < INGREDIENT.y) %>%
  count(combo = paste(INGREDIENT.x, INGREDIENT.y, sep = ", "))

An igraph option using graph_from_adjacency_matrix

library(igraph)

get.data.frame(
    graph_from_adjacency_matrix(
        crossprod(table(data)),
        mode = "undirected",
        weighted = TRUE
    )
)

gives

     from     to weight
1    haha   haha      2
2    haha   sasa      1
3    haha   zaza      2
4    haha zemzem      1
5    haha   zizi      1
6    haha   zuzu      2
7   hayda  hayda      1
8   hayda   ozam      1
9   hayda   zaza      1
10   ozam   ozam      1
11   ozam   zaza      1
12   sasa   sasa      2
13   sasa   zaza      1
14   sasa zemzem      1
15   sasa   zuzu      2
16   zaza   zaza      5
17   zaza zemzem      2
18   zaza   zeze      1
19   zaza   zizi      1
20   zaza   zozo      2
21   zaza   zuzu      4
22 zemzem zemzem      2
23 zemzem   zozo      1
24 zemzem   zuzu      2
25   zeze   zeze      1
26   zeze   zozo      1
27   zeze   zuzu      1
28   zizi   zizi      1
29   zizi   zuzu      1
30   zozo   zozo      2
31   zozo   zuzu      2
32   zuzu   zuzu      5

We could use base R

m1 <- crossprod(table(data))
subset(as.data.frame.table(m1 * lower.tri(m1, diag = TRUE)), Freq != 0)

EDIT: Comments from @ThomasIsCoding

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM