简体   繁体   English

在 R 中找到每个组中最常见的组合

[英]Find the most common combinations within each group in R

I have the following dataset, showing the INGREDIENTS contained in each PRODUCT;我有以下数据集,显示了每个产品中包含的成分;

data <- data.frame("PRODUCT" = c("Creme","Creme","Creme","Creme","Medoc","Medoc","Medoc","Medoc","Medoc","Hububu","Hububu","Hububu","Hububu","Troll","Troll","Troll","Troll","Suzuki","Suzuki","Gluglu","Gluglu","Gluglu"), 
            "INGREDIENT" = c("zeze","zaza","zozo","zuzu","zaza","sasa","haha","zuzu","zemzem","zaza","zuzu","zizi","haha","zozo","zaza","zemzem","zuzu","sasa","zuzu","ozam","zaza","hayda"))

I want to know the most common combinations of INGREDIENTS in each PRODUCT;我想知道每种产品中最常见的成分组合; which ingredient is associated with which other ingredient ?哪个成分与哪个其他成分相关? I applied the code I found in this thread here :我在这里应用了我在这个线程中找到的代码:

combinaisons_par_PRODUCT = data %>% 
  full_join(data, by="PRODUCT") %>% 
  group_by(INGREDIENT.x, INGREDIENT.y) %>% 
  summarise(n = length(unique(PRODUCT))) %>% 
  filter(INGREDIENT.x!=INGREDIENT.y) %>%
  mutate(item = paste(INGREDIENT.x, INGREDIENT.y, sep=", "))

It works but there is one final flaw;它有效,但还有一个最后的缺陷; I would like the order to be ignored.我希望订单被忽略。 For instance, this code, would give me 1 association of HAHA and SASA, and also 1 association of SASA and HAHA.例如,这个代码会给我 1 个 HAHA 和 SASA 的关联,以及 1 个 SASA 和 HAHA 的关联。 But for me, these are the same things.但对我来说,这些都是一样的。 So I would like the code to ignore the order of INGREDIENTS and give me one unique association of 2 HAHA & SASA.所以我希望代码忽略 INGREDIENTS 的顺序,并给我一个 2 HAHA & SASA 的唯一关联。

I tried sorting the INGREDIENTS before applying the code, but it didn't work either.我尝试在应用代码之前对 INGREDIENTS 进行排序,但它也不起作用。 Could someone help me please?有人可以帮我吗? How can I have these combinations unregarding the order ?无论顺序如何,我如何拥有这些组合?

Thank you very much!非常感谢你!

Does this do what you want?这是你想要的吗? I'm limiting to only situations where the combos are in alphabetical order, avoiding double counts.我仅限于组合按字母顺序排列的情况,避免重复计算。

data %>% 
  full_join(data, by="PRODUCT") %>%
  filter(INGREDIENT.x < INGREDIENT.y) %>%
  count(combo = paste(INGREDIENT.x, INGREDIENT.y, sep = ", "))

An igraph option using graph_from_adjacency_matrix使用graph_from_adjacency_matrixigraph选项

library(igraph)

get.data.frame(
    graph_from_adjacency_matrix(
        crossprod(table(data)),
        mode = "undirected",
        weighted = TRUE
    )
)

gives

     from     to weight
1    haha   haha      2
2    haha   sasa      1
3    haha   zaza      2
4    haha zemzem      1
5    haha   zizi      1
6    haha   zuzu      2
7   hayda  hayda      1
8   hayda   ozam      1
9   hayda   zaza      1
10   ozam   ozam      1
11   ozam   zaza      1
12   sasa   sasa      2
13   sasa   zaza      1
14   sasa zemzem      1
15   sasa   zuzu      2
16   zaza   zaza      5
17   zaza zemzem      2
18   zaza   zeze      1
19   zaza   zizi      1
20   zaza   zozo      2
21   zaza   zuzu      4
22 zemzem zemzem      2
23 zemzem   zozo      1
24 zemzem   zuzu      2
25   zeze   zeze      1
26   zeze   zozo      1
27   zeze   zuzu      1
28   zizi   zizi      1
29   zizi   zuzu      1
30   zozo   zozo      2
31   zozo   zuzu      2
32   zuzu   zuzu      5

We could use base R我们可以使用base R

m1 <- crossprod(table(data))
subset(as.data.frame.table(m1 * lower.tri(m1, diag = TRUE)), Freq != 0)

EDIT: Comments from @ThomasIsCoding编辑:来自@ThomasIsCoding 的评论

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM