[英]Find the most common combinations within each group in R
I have the following dataset, showing the INGREDIENTS contained in each PRODUCT;我有以下数据集,显示了每个产品中包含的成分;
data <- data.frame("PRODUCT" = c("Creme","Creme","Creme","Creme","Medoc","Medoc","Medoc","Medoc","Medoc","Hububu","Hububu","Hububu","Hububu","Troll","Troll","Troll","Troll","Suzuki","Suzuki","Gluglu","Gluglu","Gluglu"),
"INGREDIENT" = c("zeze","zaza","zozo","zuzu","zaza","sasa","haha","zuzu","zemzem","zaza","zuzu","zizi","haha","zozo","zaza","zemzem","zuzu","sasa","zuzu","ozam","zaza","hayda"))
I want to know the most common combinations of INGREDIENTS in each PRODUCT;我想知道每种产品中最常见的成分组合; which ingredient is associated with which other ingredient ?
哪个成分与哪个其他成分相关? I applied the code I found in this thread here :
我在这里应用了我在这个线程中找到的代码:
combinaisons_par_PRODUCT = data %>%
full_join(data, by="PRODUCT") %>%
group_by(INGREDIENT.x, INGREDIENT.y) %>%
summarise(n = length(unique(PRODUCT))) %>%
filter(INGREDIENT.x!=INGREDIENT.y) %>%
mutate(item = paste(INGREDIENT.x, INGREDIENT.y, sep=", "))
It works but there is one final flaw;它有效,但还有一个最后的缺陷; I would like the order to be ignored.
我希望订单被忽略。 For instance, this code, would give me 1 association of HAHA and SASA, and also 1 association of SASA and HAHA.
例如,这个代码会给我 1 个 HAHA 和 SASA 的关联,以及 1 个 SASA 和 HAHA 的关联。 But for me, these are the same things.
但对我来说,这些都是一样的。 So I would like the code to ignore the order of INGREDIENTS and give me one unique association of 2 HAHA & SASA.
所以我希望代码忽略 INGREDIENTS 的顺序,并给我一个 2 HAHA & SASA 的唯一关联。
I tried sorting the INGREDIENTS before applying the code, but it didn't work either.我尝试在应用代码之前对 INGREDIENTS 进行排序,但它也不起作用。 Could someone help me please?
有人可以帮我吗? How can I have these combinations unregarding the order ?
无论顺序如何,我如何拥有这些组合?
Thank you very much!非常感谢你!
Does this do what you want?这是你想要的吗? I'm limiting to only situations where the combos are in alphabetical order, avoiding double counts.
我仅限于组合按字母顺序排列的情况,避免重复计算。
data %>%
full_join(data, by="PRODUCT") %>%
filter(INGREDIENT.x < INGREDIENT.y) %>%
count(combo = paste(INGREDIENT.x, INGREDIENT.y, sep = ", "))
An igraph
option using graph_from_adjacency_matrix
使用
graph_from_adjacency_matrix
的igraph
选项
library(igraph)
get.data.frame(
graph_from_adjacency_matrix(
crossprod(table(data)),
mode = "undirected",
weighted = TRUE
)
)
gives给
from to weight
1 haha haha 2
2 haha sasa 1
3 haha zaza 2
4 haha zemzem 1
5 haha zizi 1
6 haha zuzu 2
7 hayda hayda 1
8 hayda ozam 1
9 hayda zaza 1
10 ozam ozam 1
11 ozam zaza 1
12 sasa sasa 2
13 sasa zaza 1
14 sasa zemzem 1
15 sasa zuzu 2
16 zaza zaza 5
17 zaza zemzem 2
18 zaza zeze 1
19 zaza zizi 1
20 zaza zozo 2
21 zaza zuzu 4
22 zemzem zemzem 2
23 zemzem zozo 1
24 zemzem zuzu 2
25 zeze zeze 1
26 zeze zozo 1
27 zeze zuzu 1
28 zizi zizi 1
29 zizi zuzu 1
30 zozo zozo 2
31 zozo zuzu 2
32 zuzu zuzu 5
We could use base R
我们可以使用
base R
m1 <- crossprod(table(data))
subset(as.data.frame.table(m1 * lower.tri(m1, diag = TRUE)), Freq != 0)
EDIT: Comments from @ThomasIsCoding编辑:来自@ThomasIsCoding 的评论
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.