[英]R - match combinations of nested list values from an index and return value
Hi I have the two data sets.嗨,我有两个数据集。 The first one is a list of genes linked to a given cluster (0-7):第一个是链接到给定簇 (0-7) 的基因列表:
# gene output
Cluster <- rep(0:7, each = 10)
Gene <- c("LMO3", "NEUROD6", "NFIB", "SNAP25", "RTN1", "CPE", "SOX11", "CSRP2", "VAMP2", "ID2", "EMX2", "LHX5-AS1","PEG10",
"HES1", "TRH", "WLS", "TPBG", "RPS29", "CRABP2", "RSPO3", "RPL17", "RPL7", "PTMA", "RPL36A", "HMGN2", "H2AFZ",
"NFIB", "PABPC1", "NEUROD6", "HNRNPH1", "PTN", "FABP7", "IGFBP2", "ID4", "C1orf61", "VIM", "RPS27L", "FABP5",
"SDCBP", "BNIP3", "TCF7L2", "NEFL", "HMGCS1", "GAP43", "GPM6A", "SQLE", "ID4", "MSMO1", "SCOC", "BASP1", "TTR",
"MEST", "TPBG", "MDK", "TMBIM6", "RCN1", "C8orf59","ID3","PKM", "PTN", "NCOR1", "ELAVL4", "NNAT", "ETFB",
"STMN2", "TUBA1A", "GNG3", "MALAT1", "SOX4", "TUBB2B", "CRYAB", "GFAP", "CHCHD2", "HOPX", "LGALS1", "SCRG1", "ISG15",
"AC090498.1", "B2M", "CLU")
df <- data.frame(cbind(Cluster, Gene))
The second is an index which provides cell-type annotations for specific combinations of genes:第二个是为特定基因组合提供细胞类型注释的索引:
# index
Type <- c("Radial Glia", "Excitatory Neuron ", "Inhibitory Neuron","Inhibitory Neuron",
"IPC","Excitatory Neuron ","Radial Glia","Microglia","IPC","Inhibitory Neuron")
Subtype <- c("early", "Layer IV", "SST-MGE1", "SST-MGE1", "IPC-div2",
"Parietal and Temporal", "oRG/Astrocyte", "Microglia", "IPC-new", "MGE2")
Markers <- c("TOP2A AURK HMGB CTNNB1", "PPP1R1B SCN2A RORB CRYM", "DLX6-AS1 DLX1 SST DCX", "ERBB4 SST DLX2 DLX5 DLX6-AS1",
"CCNB2 NEUROD4 KIF15 PENK HES6 ZFHX4 GLI3", "MEF2C STMN2 FLT ROBO CRYM", "AQP4 GFAP AGT DIO2 IL33",
"C1QB AIF1 CCL4 C1QC", "CENPK EOMES", "CCK LHX6 SCGN SST")
index <- data.frame(cbind(Type, Subtype, Markers))
I am trying to find specific combinations outlined in Markers from the list of genes in my df.我正在尝试从我的 df 中的基因列表中找到 Markers 中概述的特定组合。 When such a match is found the corresponding type and subtype would be returned.当找到这样的匹配时,将返回相应的类型和子类型。 However, there's a couple of caveats that I am finding very difficult to wrap my head around.然而,有几个警告我发现很难理解。
My project data consists of dozens of df-like outputs made up of varied numbers of respective clusters, each containing hundreds to thousands of genes.我的项目数据由数十个类似 df 的输出组成,这些输出由不同数量的各自集群组成,每个集群包含数百到数千个基因。 I have tried my best to search solutions online but I am unfortunately drawing a total blank here.我已尽力在网上搜索解决方案,但不幸的是我在这里完全空白。
Any help/toughts/suggestions would be greatly appreciated.任何帮助/想法/建议将不胜感激。
Edit:编辑:
the out put could look like so:输出可能如下所示:
Cluster Gene Type Subtype
1 0 LMO3 Radial Glia early
2 0 NEUROD6 <NA> <NA>
3 0 NFIB <NA> <NA>
4 0 SNAP25 <NA> <NA>
5 0 RTN1 <NA> <NA>
6 0 CPE <NA> <NA>
where a correct match(es) would add a row(s) to the df with corresponding type and subtype for each cluster, leaving the remainder empty (NAs).其中正确的匹配(es)将向df添加一行,每个集群具有相应的类型和子类型,其余为空(NA)。
There is probably a much simpler way of doing this but here it is with a loop;可能有一种更简单的方法来做到这一点,但这里是一个循环;
output = data.frame(Cluster=as.character(), Gene=as.character(), Type=as.character(), Subtype=as.character())
for(i in 1:nrow(df)){
cluster = df[i,1]
gene = df[i,2]
type = index[grep(gene, index$Markers),]
n_types = nrow(type)
tmp = data.frame(Cluster=rep(cluster,n_types),
Gene=rep(gene, n_types), Type=type[,1], Subtype=type[,2])
output = rbind(output,tmp)
}
I'm assuming you want to annotate each cluster of genes with the types from the index, when all of the markers for a type are present in the cluster's pool of genes.我假设你想用索引中的类型来注释每个基因簇,当一个类型的所有标记都存在于簇的基因池中时。
I'm also going to use some simplified datasets;我还将使用一些简化的数据集; two simplified types in the index:索引中的两种简化类型:
library(tidyverse)
index <- bind_rows(
tibble(type = "AB", subtype = "X", markers = c("A", "B")),
tibble(type = "BC", subtype = "Y", markers = c("B", "C")),
)
index
#> # A tibble: 4 x 3
#> type subtype markers
#> <chr> <chr> <chr>
#> 1 AB X A
#> 2 AB X B
#> 3 BC Y B
#> 4 BC Y C
And three different clusters that illustrate different matching scenarios:以及三个不同的集群,说明了不同的匹配场景:
clusters <- bind_rows(
tibble(cluster = 0, genes = c("A", "B", "C")), # 2 matches
tibble(cluster = 1, genes = c("B", "C", "D")), # 1 match
tibble(cluster = 2, genes = c("C", "D", "E")), # No matches
)
clusters
#> # A tibble: 9 x 2
#> cluster genes
#> <dbl> <chr>
#> 1 0 A
#> 2 0 B
#> 3 0 C
#> 4 1 B
#> 5 1 C
#> 6 1 D
#> 7 2 C
#> 8 2 D
#> 9 2 E
I would approach this by first making a function that returns matching types for a given pool of genes:我将通过首先创建一个函数来返回给定基因池的匹配类型来解决这个问题:
match_index <- function(genes) {
matches <- index %>%
group_by(type, subtype) %>%
filter(all(markers %in% genes)) %>%
distinct(type, subtype)
# If none matched, return a row of NAs
if (nrow(matches)) matches else matches[NA_integer_, ]
}
Then you can just summarise each cluster with the function:然后你可以用函数总结每个集群:
clusters %>%
group_by(cluster) %>%
summarise(match_index(genes))
#> `summarise()` regrouping output by 'cluster' (override with `.groups` argument)
#> # A tibble: 4 x 3
#> # Groups: cluster [3]
#> cluster type subtype
#> <dbl> <chr> <chr>
#> 1 0 AB X
#> 2 0 BC Y
#> 3 1 BC Y
#> 4 2 <NA> <NA>
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.