R - 匹配来自索引和返回值的嵌套列表值的组合

Question

Hi I have the two data sets.嗨，我有两个数据集。 The first one is a list of genes linked to a given cluster (0-7):第一个是链接到给定簇 (0-7) 的基因列表：

# gene output

Cluster <- rep(0:7, each = 10)

Gene <- c("LMO3", "NEUROD6", "NFIB", "SNAP25", "RTN1", "CPE", "SOX11", "CSRP2", "VAMP2", "ID2", "EMX2", "LHX5-AS1","PEG10",
          "HES1", "TRH", "WLS", "TPBG", "RPS29", "CRABP2", "RSPO3", "RPL17", "RPL7", "PTMA", "RPL36A", "HMGN2", "H2AFZ",
          "NFIB", "PABPC1", "NEUROD6", "HNRNPH1", "PTN", "FABP7", "IGFBP2", "ID4", "C1orf61", "VIM", "RPS27L", "FABP5",
          "SDCBP", "BNIP3", "TCF7L2", "NEFL", "HMGCS1", "GAP43", "GPM6A", "SQLE", "ID4", "MSMO1", "SCOC", "BASP1", "TTR",
          "MEST", "TPBG", "MDK", "TMBIM6", "RCN1", "C8orf59","ID3","PKM", "PTN", "NCOR1", "ELAVL4", "NNAT", "ETFB",
          "STMN2", "TUBA1A", "GNG3", "MALAT1", "SOX4", "TUBB2B", "CRYAB", "GFAP", "CHCHD2", "HOPX", "LGALS1", "SCRG1", "ISG15",
          "AC090498.1", "B2M", "CLU")

df <- data.frame(cbind(Cluster, Gene))

The second is an index which provides cell-type annotations for specific combinations of genes:第二个是为特定基因组合提供细胞类型注释的索引：

# index

Type <- c("Radial Glia", "Excitatory Neuron ", "Inhibitory Neuron","Inhibitory Neuron",
          "IPC","Excitatory Neuron ","Radial Glia","Microglia","IPC","Inhibitory Neuron")

Subtype <- c("early", "Layer IV", "SST-MGE1", "SST-MGE1", "IPC-div2", 
             "Parietal and Temporal", "oRG/Astrocyte", "Microglia", "IPC-new", "MGE2")

Markers <- c("TOP2A AURK HMGB CTNNB1", "PPP1R1B SCN2A RORB CRYM", "DLX6-AS1 DLX1 SST DCX", "ERBB4 SST DLX2 DLX5 DLX6-AS1",
             "CCNB2 NEUROD4 KIF15 PENK HES6 ZFHX4 GLI3", "MEF2C STMN2 FLT ROBO CRYM", "AQP4 GFAP AGT DIO2 IL33",
             "C1QB AIF1 CCL4 C1QC", "CENPK EOMES", "CCK LHX6 SCGN SST")

index <- data.frame(cbind(Type, Subtype, Markers))

I am trying to find specific combinations outlined in Markers from the list of genes in my df.我正在尝试从我的 df 中的基因列表中找到 Markers 中概述的特定组合。 When such a match is found the corresponding type and subtype would be returned.当找到这样的匹配时，将返回相应的类型和子类型。 However, there's a couple of caveats that I am finding very difficult to wrap my head around.然而，有几个警告我发现很难理解。

The lists for each cluster may contain multiple marker combinations - thus the function should go over each marker combination iteratively rather than stop when the first match is found.每个集群的列表可能包含多个标记组合 - 因此该函数应该迭代地遍历每个标记组合，而不是在找到第一个匹配项时停止。
The index-matching process should operate on each cluster separately - ie check genes in cluster 0 for marker matches and return type/subtype(s), then repeat the steps for cluster 1 etc.索引匹配过程应该分别对每个簇进行操作——即检查簇 0 中的基因是否有标记匹配并返回类型/子类型，然后重复簇 1 等的步骤。

My project data consists of dozens of df-like outputs made up of varied numbers of respective clusters, each containing hundreds to thousands of genes.我的项目数据由数十个类似 df 的输出组成，这些输出由不同数量的各自集群组成，每个集群包含数百到数千个基因。 I have tried my best to search solutions online but I am unfortunately drawing a total blank here.我已尽力在网上搜索解决方案，但不幸的是我在这里完全空白。

Any help/toughts/suggestions would be greatly appreciated.任何帮助/想法/建议将不胜感激。

Edit:编辑：

the out put could look like so:输出可能如下所示：

  Cluster    Gene        Type Subtype
1       0    LMO3 Radial Glia   early
2       0 NEUROD6        <NA>    <NA>
3       0    NFIB        <NA>    <NA>
4       0  SNAP25        <NA>    <NA>
5       0    RTN1        <NA>    <NA>
6       0     CPE        <NA>    <NA>

where a correct match(es) would add a row(s) to the df with corresponding type and subtype for each cluster, leaving the remainder empty (NAs).其中正确的匹配（es）将向df添加一行，每个集群具有相应的类型和子类型，其余为空（NA）。

Answer 1

There is probably a much simpler way of doing this but here it is with a loop;可能有一种更简单的方法来做到这一点，但这里是一个循环；

output = data.frame(Cluster=as.character(), Gene=as.character(), Type=as.character(), Subtype=as.character())

for(i in 1:nrow(df)){
  cluster = df[i,1]
  gene = df[i,2]
  type = index[grep(gene, index$Markers),]
  n_types = nrow(type)
  tmp = data.frame(Cluster=rep(cluster,n_types),
                   Gene=rep(gene, n_types), Type=type[,1], Subtype=type[,2])
  output = rbind(output,tmp)
}

Answer 2

I'm assuming you want to annotate each cluster of genes with the types from the index, when all of the markers for a type are present in the cluster's pool of genes.我假设你想用索引中的类型来注释每个基因簇，当一个类型的所有标记都存在于簇的基因池中时。

I'm also going to use some simplified datasets;我还将使用一些简化的数据集； two simplified types in the index:索引中的两种简化类型：

library(tidyverse)

index <- bind_rows(
  tibble(type = "AB", subtype = "X", markers = c("A", "B")),
  tibble(type = "BC", subtype = "Y", markers = c("B", "C")),
)

index
#> # A tibble: 4 x 3
#>   type  subtype markers
#>   <chr> <chr>   <chr>  
#> 1 AB    X       A      
#> 2 AB    X       B      
#> 3 BC    Y       B      
#> 4 BC    Y       C

And three different clusters that illustrate different matching scenarios:以及三个不同的集群，说明了不同的匹配场景：

clusters <- bind_rows(
  tibble(cluster = 0, genes = c("A", "B", "C")), # 2 matches
  tibble(cluster = 1, genes = c("B", "C", "D")), # 1 match
  tibble(cluster = 2, genes = c("C", "D", "E")), # No matches
)

clusters
#> # A tibble: 9 x 2
#>   cluster genes
#>     <dbl> <chr>
#> 1       0 A    
#> 2       0 B    
#> 3       0 C    
#> 4       1 B    
#> 5       1 C    
#> 6       1 D    
#> 7       2 C    
#> 8       2 D    
#> 9       2 E

I would approach this by first making a function that returns matching types for a given pool of genes:我将通过首先创建一个函数来返回给定基因池的匹配类型来解决这个问题：

match_index <- function(genes) {
  matches <- index %>% 
    group_by(type, subtype) %>% 
    filter(all(markers %in% genes)) %>% 
    distinct(type, subtype)

  # If none matched, return a row of NAs  
  if (nrow(matches)) matches else matches[NA_integer_, ]
}

Then you can just summarise each cluster with the function:然后你可以用函数总结每个集群：

clusters %>% 
  group_by(cluster) %>% 
  summarise(match_index(genes))
#> `summarise()` regrouping output by 'cluster' (override with `.groups` argument)
#> # A tibble: 4 x 3
#> # Groups:   cluster [3]
#>   cluster type  subtype
#>     <dbl> <chr> <chr>  
#> 1       0 AB    X      
#> 2       0 BC    Y      
#> 3       1 BC    Y      
#> 4       2 <NA>  <NA>

R - 匹配来自索引和返回值的嵌套列表值的组合

问题描述

2 个解决方案

解决方案1
0 2020-11-12 16:54:45

解决方案2
0 2020-11-12 17:35:21

R - 匹配来自索引和返回值的嵌套列表值的组合

问题描述

2 个解决方案

解决方案1 0 2020-11-12 16:54:45

解决方案2 0 2020-11-12 17:35:21

解决方案1
0 2020-11-12 16:54:45

解决方案2
0 2020-11-12 17:35:21