查找组合数

Question

I have a dataset where each row corresponds to a sample that was tested for the existence of specific drugs (one sample can have more than one drug).我有一个数据集，其中每一行对应一个样本，该样本被测试是否存在特定药物（一个样本可以包含多种药物）。 I am trying to find the most common drug combinations and I wanted to know if there is a better way to do it.我试图找到最常见的药物组合，我想知道是否有更好的方法来做到这一点。 This is an example of my dataset:这是我的数据集的一个例子：

    id = c(id1,id2,id3,id4,id5,id6,id7,id8)
    d1 = c(1,1,0,1,0,1,0,1)
    d2 = c(0,0,1,0,1,1,1,0)
    d3 = c(1,0,1,1,0,1,0,1)

    df = tibble(id, d1, d2, d3)

column id corresponds to the id of the sample and the other columns are the drugs for which each sample was tested (in the original dataset I have 42 drugs/columns).列id对应于样本的 id，其他列是测试每个样本的药物（在原始数据集中，我有 42 个药物/列）。 1 means Yes, 0 means No. 1 表示是，0 表示否。

In order to get the number of combinations I did the following:为了获得组合的数量，我执行了以下操作：

df %>% unite("tot", d1:d3, sep = "-", remove = F) %>%
  group_by(tot) %>% summarise(n = n())

# A tibble: 5 x 2
  tot       n
  <chr> <int>
1 0-1-0     2
2 0-1-1     1
3 1-0-0     1
4 1-0-1     3
5 1-1-1     1

Ok, now I know that combination 1-0-1 (d1 + d3) is the most common.好的，现在我知道组合1-0-1 (d1 + d3) 是最常见的。 That is relatively simple, taken into account that in the example I only have 3 drugs.这相对简单，考虑到在示例中我只有 3 种药物。 The problem is when I do it for the 42 drugs and I end up with a huge string that I need to translate back.问题是当我为 42 种药物做这件事时，我最终得到了一个巨大的字符串，我需要翻译回来。

Is there a more efficient way to do this?有没有更有效的方法来做到这一点？ Thanks!谢谢！

Answer 1

Using dplyr , you can do:使用dplyr ，您可以执行以下操作：

df %>%
 group_by_at(vars(-id)) %>%
 count()

     d1    d2    d3     n
  <dbl> <dbl> <dbl> <int>
1     0     1     0     2
2     0     1     1     1
3     1     0     0     1
4     1     0     1     3
5     1     1     1     1

If you want the names of columns with ones from the n (here two) most frequent combinations, with the addition of tidyr :如果您希望列的名称来自 n（这里有两个）最常见的组合，并添加tidyr ：

df %>%
 group_by_at(vars(-id)) %>%
 count() %>%
 ungroup() %>%
 top_n(2, wt = n) %>%
 rowid_to_column() %>%
 pivot_longer(-c(rowid, n)) %>%
 group_by(rowid, n) %>%
 summarise(name = paste(name[value == 1], collapse = ", "))

  rowid     n name  
  <int> <int> <chr> 
1     1     2 d2    
2     2     3 d1, d3

Answer 2

additional option附加选项

df %>% 
  pivot_longer(-id) %>% 
  filter(value != 0) %>% 
  group_by(id) %>% 
  summarise(name = str_c(name, collapse = ", ")) %>% 
  group_by(name) %>% 
  count() %>% 
  arrange(-n)

查找组合数

问题描述

2 个解决方案

解决方案1
3 已采纳 2020-03-17 16:44:16

解决方案2
1 2020-03-17 17:36:34

查找组合数

问题描述

2 个解决方案

解决方案1 3 已采纳 2020-03-17 16:44:16

解决方案2 1 2020-03-17 17:36:34

解决方案1
3 已采纳 2020-03-17 16:44:16

解决方案2
1 2020-03-17 17:36:34