简体   繁体   English

查找组合数

[英]Find number of combinations

I have a dataset where each row corresponds to a sample that was tested for the existence of specific drugs (one sample can have more than one drug).我有一个数据集,其中每一行对应一个样本,该样本被测试是否存在特定药物(一个样本可以包含多种药物)。 I am trying to find the most common drug combinations and I wanted to know if there is a better way to do it.我试图找到最常见的药物组合,我想知道是否有更好的方法来做到这一点。 This is an example of my dataset:这是我的数据集的一个例子:

    id = c(id1,id2,id3,id4,id5,id6,id7,id8)
    d1 = c(1,1,0,1,0,1,0,1)
    d2 = c(0,0,1,0,1,1,1,0)
    d3 = c(1,0,1,1,0,1,0,1)

    df = tibble(id, d1, d2, d3)

column id corresponds to the id of the sample and the other columns are the drugs for which each sample was tested (in the original dataset I have 42 drugs/columns).id对应于样本的 id,其他列是测试每个样本的药物(在原始数据集中,我有 42 个药物/列)。 1 means Yes, 0 means No. 1 表示是,0 表示否。

In order to get the number of combinations I did the following:为了获得组合的数量,我执行了以下操作:

df %>% unite("tot", d1:d3, sep = "-", remove = F) %>%
  group_by(tot) %>% summarise(n = n())

# A tibble: 5 x 2
  tot       n
  <chr> <int>
1 0-1-0     2
2 0-1-1     1
3 1-0-0     1
4 1-0-1     3
5 1-1-1     1

Ok, now I know that combination 1-0-1 (d1 + d3) is the most common.好的,现在我知道组合1-0-1 (d1 + d3) 是最常见的。 That is relatively simple, taken into account that in the example I only have 3 drugs.这相对简单,考虑到在示例中我只有 3 种药物。 The problem is when I do it for the 42 drugs and I end up with a huge string that I need to translate back.问题是当我为 42 种药物做这件事时,我最终得到了一个巨大的字符串,我需要翻译回来。

Is there a more efficient way to do this?有没有更有效的方法来做到这一点? Thanks!谢谢!

Using dplyr , you can do:使用dplyr ,您可以执行以下操作:

df %>%
 group_by_at(vars(-id)) %>%
 count()

     d1    d2    d3     n
  <dbl> <dbl> <dbl> <int>
1     0     1     0     2
2     0     1     1     1
3     1     0     0     1
4     1     0     1     3
5     1     1     1     1

If you want the names of columns with ones from the n (here two) most frequent combinations, with the addition of tidyr :如果您希望列的名称来自 n(这里有两个)最常见的组合,并添加tidyr

df %>%
 group_by_at(vars(-id)) %>%
 count() %>%
 ungroup() %>%
 top_n(2, wt = n) %>%
 rowid_to_column() %>%
 pivot_longer(-c(rowid, n)) %>%
 group_by(rowid, n) %>%
 summarise(name = paste(name[value == 1], collapse = ", "))

  rowid     n name  
  <int> <int> <chr> 
1     1     2 d2    
2     2     3 d1, d3

additional option附加选项

df %>% 
  pivot_longer(-id) %>% 
  filter(value != 0) %>% 
  group_by(id) %>% 
  summarise(name = str_c(name, collapse = ", ")) %>% 
  group_by(name) %>% 
  count() %>% 
  arrange(-n)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM