[英]Faster processing alternative to a for loop
我正在使用for
循環將xen.biomart$chromosome_name
(第 3 列)中與chr.alias.biomart$ensembl
的值替換為同一 chr.alias.biomart chr.alias.biomart$ucsc
上的值chr.alias.biomart
排。 它有效,但處理時間過長(+20 分鍾)。 有沒有更快的選擇?
for(i in 1:nrow(xen.biomart)){
for(x in 1:nrow(chr.alias.biomart)){
xen.biomart[i,3][xen.biomart[i,3] == chr.alias.biomart$ensembl[x]] <- chr.alias.biomart$ucsc[x]
}}
xen.biomart
有 146816 行, chr.alias.biomart
有 46 行,其中包含我要替換的值的引用。
> head(xen.biomart[,3])
[1] "MT" "MT" "MT" "MT" "MT" "MT"
> head(chr.alias.biomart)
ensembl ucsc assembly genbank refseq
1 1 chr1 Chr1 CM004443.2 NC_030677.2
2 10 chr10 Chr10 CM004452.2 NC_030686.2
3 2 chr2 Chr2 CM004444.2 NC_030678.2
4 3 chr3 Chr3 CM004445.2 NC_030679.2
5 4 chr4 Chr4 CM004446.2 NC_030680.2
6 5 chr5 Chr5 CM004447.2 NC_030681.2
> dput(xen.biomart[c(1,1000,10000,15000), ])
structure(list(ensembl_gene_id = c("ENSXETG00000034356", "ENSXETG00000034782",
"ENSXETG00000029203", "ENSXETG00000021054"), external_gene_name = c("",
"", "xtr-mir-144", "cdk2ap2"), chromosome_name = c("MT", "1",
"2", "3"), start_position = c(1L, 122943147L, 34088294L, 148518850L
), end_position = c(68L, 122971793L, 34088355L, 148548901L),
description = c("", "", "xtr-mir-144 [Source:miRBase;Acc:MI0004938]",
"claudin 15, 1 [Source:Xenbase;Acc:XB-GENE-994817]")), row.names = c(1L,
1000L, 10000L, 15000L), class = "data.frame")
> dput(chr.alias.biomart[c(1:10,46),])
structure(list(ensembl = c("1", "10", "2", "3", "4", "5", "6",
"7", "8", "9", "MT"), ucsc = c("chr1", "chr10", "chr2", "chr3",
"chr4", "chr5", "chr6", "chr7", "chr8", "chr9", "chrM"), assembly = c("Chr1",
"Chr10", "Chr2", "Chr3", "Chr4", "Chr5", "Chr6", "Chr7", "Chr8",
"Chr9", "MT"), genbank = c("CM004443.2", "CM004452.2", "CM004444.2",
"CM004445.2", "CM004446.2", "CM004447.2", "CM004448.2", "CM004449.2",
"CM004450.2", "CM004451.2", "MT"), refseq = c("NC_030677.2",
"NC_030686.2", "NC_030678.2", "NC_030679.2", "NC_030680.2", "NC_030681.2",
"NC_030682.2", "NC_030683.2", "NC_030684.2", "NC_030685.2", "NC_006839.1"
)), row.names = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 46L
), class = "data.frame")
這應該快得多。 如果您需要更高的速度,請使用data.table
package。
library(dplyr)
xen.biomart = xen.biomart %>%
## join the relevant alias column into xen biomart
left_join(
select(chr.alias.biomart, ensembl, ucsc),
by = c("chromosome_name" = "ensembl")
) %>%
## replace all chromosome_names with ucsc value (if not NA)
mutate(chromosome_name = coalesce(ucsc, chromosome_name)) %>%
## drop ucsc columns
select(-ucsc)
# ensembl_gene_id external_gene_name chromosome_name start_position end_position
# 1 ENSXETG00000034356 chrM 1 68
# 2 ENSXETG00000034782 chr1 122943147 122971793
# 3 ENSXETG00000029203 xtr-mir-144 chr2 34088294 34088355
# 4 ENSXETG00000021054 cdk2ap2 chr3 148518850 148548901
# description
# 1
# 2
# 3 xtr-mir-144 [Source:miRBase;Acc:MI0004938]
# 4 claudin 15, 1 [Source:Xenbase;Acc:XB-GENE-994817]
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.