[英]R extract genes from Biostrings format within a list
我有一个序列列表
class(myseq)
[1] "list"
列表的每个元素内部都以 Biostrings package 的格式存储多个元素。 在此示例中,列表 myseq 包含 6 个样本
names(myseq)
[1] "Sample-01" "Sample-02" "Sample-03" "Sample-04" "Sample-05" "Sample-06"
并且每个样本的格式为 Biostrings package
class(myseq[["Sample-01"]])
[1] "AAStringSet"
attr(,"package")
[1] "Biostrings"
myseqs[["Sample-01"]]
AAStringSet object of length 143:
width seq names
[1] 453 MISIIKRFLGKRQPRQSAEHHYEFLPAHLALAQKPPSPFARLTAITLSIGVLAVLLWAY...VFPAQVQLNKNNIVIDGQTVELTPGMSVVAEIKTDKRRVIDYLLSPIREYQAEALRER IMEHDJCA_00190
[2] 701 MSNNEGLTLICIHFYLSIISGNREQFKENANLTKTNNYKEELKKIQKANKVRITTKHSQ...VITIAHRLSTVRDCNRIIVLHQGAIVEQGSHQQLLTHGKQYKQLWQLQQELKQEETTA IMEHDJCA_00191
... ... ...
[142] 275 MTAITISDQEYRDFSRFLESQCGIVLGDSKQYLVRSRLSPLVTKFKLASLSDLLRDVVT...RNVLIYFSPDMKSKVLNQMANSLNPGGYLLLGASESLTGLTDRFEMVRCNPGIIYKLK IMEHDJCA_03929
[143] 172 MPLLDSFTVDHTRMHAPAVRVAKTMQTPKGDTITVFDLRFTAPNKDILSEKGIHTLEHL...ESQNKIPELNEYQCGTAAMHSLDEAKQIAQNILAVGISVNRNDELALPEAMLKELKVD IMEHDJCA_04048
所以我想使用 data.frame 提取每个样本的特定基因
head(df)
qseqid samples gene
1 IMEHDJCA_02683 Sample-01 pilB
2 DIBHEKPI_01114 Sample-02 pilB
3 LLMDBGDK_00899 Sample-03 pilB
4 EBMGLGMO_01529 Sample-04 pilB
5 ILCJGNBA_00973 Sample-05 pilB
6 JAGNDBFC_01143 Sample-06 pilB
使用带有 for 循环的 qseqid 列提取每个样本的基因
genes <- c()
for(i in 1:nrow(df)){
sample <- df[i,2]
qseq_id <- df[i,1]
seq <- myseqs[[sample]]
genes[[i]] <- Biostrings::AAStringSet(seq[qseq_id])
}
我的问题是基因变量是一个列表
[[1]]
AAStringSet object of length 1:
width seq names
[1] 562 MQSNLATILRQANQLSLTQEQACRETIQASGVTAPEALLQLGFFQPDELTEKLSAIFGLP...YEVMPFDEQLAEAIVKGASVQSLEMLARQKGMMTLKDSGLEKLKQGITSLEELQRVLYL IMEHDJCA_02683
[[2]]
AAStringSet object of length 1:
width seq names
[1] 562 MQSNLATILRQANQLSLTQEQACRETIQASGVTAPEALLQLGFFQPDELTEKLSAIFGLP...YEVMPFDEQLAEAIVKGASVQSLEMLARQKGMMTLKDSGLEKLKQGITSLEELQRVLYL DIBHEKPI_01114
[[3]]
AAStringSet object of length 1:
width seq names
[1] 561 MTNLATILRQANQLSLTQEQACRETIQASGVTAPEALLQLGFFQPDELTEQLSAIFGLPC...YEVMPFDEQLAEAIVKGASVQSLEMLAQQKGMMTLKDSGLEKLKQGITSLEELQRVLYL LLMDBGDK_00899
[[4]]
AAStringSet object of length 1:
width seq names
[1] 562 MQSNLATILRQANQLSLTQEQACRETIQASGVTAPEALLQLGFFQPDELTEKLSAIFGLP...YEVMPFNEQLAEAIVKGASVQSLEMLARQKGMMTLKDSGLEKLKQGITSLEELQRVLYL EBMGLGMO_01529
[[5]]
AAStringSet object of length 1:
width seq names
[1] 562 MQSNLATILRQANQLSLTQEQACRETIQASGVTAPEALLQLGFFQPDELTEKLSAIFGLP...YEVMPFNEQLAEAIVKGASVQSLEMLARQKGMMTLKDSGLEKLKQGITSLEELQRVLYL ILCJGNBA_00973
[[6]]
AAStringSet object of length 1:
width seq names
[1] 562 MQSNLATILRQANQLSLTQEQACRETIQASGVTAPEALLQLGFFQPDELTEKLSAIFGLP...YEVMPFNEQLAEAIVKGASVQSLEMLARQKGMMTLKDSGLEKLKQGITSLEELQRVLYL JAGNDBFC_01143
我只想要一个包含 6 个样本基因的 Biostrings object,例如:
genes
AAStringSet object of length 6:
width seq names
[1] 562 MQSNLATILRQANQLSLTQEQACRETIQASGVTAPEALLQLGFFQPDELTEKLSAIFGLP...YEVMPFDEQLAEAIVKGASVQSLEMLARQKGMMTLKDSGLEKLKQGITSLEELQRVLYL IMEHDJCA_02683
[2] 562 MQSNLATILRQANQLSLTQEQACRETIQASGVTAPEALLQLGFFQPDELTEKLSAIFGLP...YEVMPFDEQLAEAIVKGASVQSLEMLARQKGMMTLKDSGLEKLKQGITSLEELQRVLYL DIBHEKPI_01114
[3] 561 MTNLATILRQANQLSLTQEQACRETIQASGVTAPEALLQLGFFQPDELTEQLSAIFGLPC...YEVMPFDEQLAEAIVKGASVQSLEMLAQQKGMMTLKDSGLEKLKQGITSLEELQRVLYL LLMDBGDK_00899
[4] 562 MQSNLATILRQANQLSLTQEQACRETIQASGVTAPEALLQLGFFQPDELTEKLSAIFGLP...YEVMPFNEQLAEAIVKGASVQSLEMLARQKGMMTLKDSGLEKLKQGITSLEELQRVLYL EBMGLGMO_01529
[5] 562 MQSNLATILRQANQLSLTQEQACRETIQASGVTAPEALLQLGFFQPDELTEKLSAIFGLP...YEVMPFNEQLAEAIVKGASVQSLEMLARQKGMMTLKDSGLEKLKQGITSLEELQRVLYL ILCJGNBA_00973
[6] 562 MQSNLATILRQANQLSLTQEQACRETIQASGVTAPEALLQLGFFQPDELTEKLSAIFGLP...YEVMPFNEQLAEAIVKGASVQSLEMLARQKGMMTLKDSGLEKLKQGITSLEELQRVLYL JAGNDBFC_01143
我试图在没有 [[i]] 的情况下存储基因变量:
genes <- Biostrings::AAStringSet(seq[qseq_id])
但它只保留最后一个序列。
如果我手动执行,将类似于:
S01 <- Biostrings::AAStringSet(myseqs[["Sample-01"]]["IMEHDJCA_02683"])
S02 <- Biostrings::AAStringSet(myseqs[["Sample-02"]]["DIBHEKPI_01114"])
以此类推 Sample-03 到 Sample-n.... 然后生成 Biostrings object
genes <- c(S01, S02, S03, S04, S05, S06)
有没有人知道怎么做???
非常感谢 !!!
您可以取消列出 AAStringSet object,例如
library(Biostrings)
a <- AAString("MISIIKRFLGKRQPRQSAEHHYEFLPAHLALAQKPPSPFARLTAITLSIGVLAVLLWAYVFPAQVQLNKNNIVIDGQTVELTPGMSVVAEIKTDKRRVIDYLLSPIREYQAEALRER")
b <- AAString("MTAITISDQEYRDFSRFLESQCGIVLGDSKQYLVRSRLSPLVTKFKLASLSDLLRDVVTRNVLIYFSPDMKSKVLNQMANSLNPGGYLLLGASESLTGLTDRFEMVRCNPGIIYKLK")
myseqs <- list("Sample-01" = AAStringSet(c("IMEHDJCA_02683" = a)),
"Sample-02" = AAStringSet(c("DIBHEKPI_01114" = b)))
class(myseqs)
#> [1] "list"
names(myseqs)
#> [1] "Sample-01" "Sample-02"
myseqs[["Sample-01"]]
#> AAStringSet object of length 1:
#> width seq names
#> [1] 117 MISIIKRFLGKRQPRQSAEHHYE...KRRVIDYLLSPIREYQAEALRER IMEHDJCA_02683
class(myseqs[["Sample-01"]])
#> [1] "AAStringSet"
#> attr(,"package")
#> [1] "Biostrings"
df <- read.table(text = " qseqid samples gene
1 IMEHDJCA_02683 Sample-01 pilB
2 DIBHEKPI_01114 Sample-02 pilB", header = TRUE)
df
#> qseqid samples gene
#> 1 IMEHDJCA_02683 Sample-01 pilB
#> 2 DIBHEKPI_01114 Sample-02 pilB
genes <- c()
for(i in 1:nrow(df)){
sample <- df[i,2]
qseq_id <- df[i,1]
seq <- myseqs[[sample]]
genes[[i]] <- unlist(AAStringSet(seq[qseq_id]))
}
genes_object <- AAStringSet(genes)
genes_object
#> AAStringSet object of length 2:
#> width seq
#> [1] 117 MISIIKRFLGKRQPRQSAEHHYEFLPAHLALAQK...MSVVAEIKTDKRRVIDYLLSPIREYQAEALRER
#> [2] 117 MTAITISDQEYRDFSRFLESQCGIVLGDSKQYLV...GGYLLLGASESLTGLTDRFEMVRCNPGIIYKLK
由reprex package (v2.0.1) 于 2022 年 8 月 9 日创建
但是,不幸的是,副作用是您丢失了@ranges@NAMES 属性(即“名称”列消失了)。
我认为您可以将名称添加到新的 object,但您需要检查它是否适用于您的用例(即检查结果是否正确):
library(Biostrings)
a <- AAString("MISIIKRFLGKRQPRQSAEHHYEFLPAHLALAQKPPSPFARLTAITLSIGVLAVLLWAYVFPAQVQLNKNNIVIDGQTVELTPGMSVVAEIKTDKRRVIDYLLSPIREYQAEALRER")
b <- AAString("MTAITISDQEYRDFSRFLESQCGIVLGDSKQYLVRSRLSPLVTKFKLASLSDLLRDVVTRNVLIYFSPDMKSKVLNQMANSLNPGGYLLLGASESLTGLTDRFEMVRCNPGIIYKLK")
myseqs <- list("Sample-01" = AAStringSet(c("IMEHDJCA_02683" = a)),
"Sample-02" = AAStringSet(c("DIBHEKPI_01114" = b)))
class(myseqs)
#> [1] "list"
names(myseqs)
#> [1] "Sample-01" "Sample-02"
myseqs[["Sample-01"]]
#> AAStringSet object of length 1:
#> width seq names
#> [1] 117 MISIIKRFLGKRQPRQSAEHHYE...KRRVIDYLLSPIREYQAEALRER IMEHDJCA_02683
class(myseqs[["Sample-01"]])
#> [1] "AAStringSet"
#> attr(,"package")
#> [1] "Biostrings"
df <- read.table(text = " qseqid samples gene
1 IMEHDJCA_02683 Sample-01 pilB
2 DIBHEKPI_01114 Sample-02 pilB", header = TRUE)
df
#> qseqid samples gene
#> 1 IMEHDJCA_02683 Sample-01 pilB
#> 2 DIBHEKPI_01114 Sample-02 pilB
genes <- c()
qseqids <- c()
for(i in 1:nrow(df)){
sample <- df[i,2]
qseq_id <- df[i,1]
seq <- myseqs[[sample]]
genes[[i]] <- unlist(AAStringSet(seq[qseq_id]))
qseqids <- c(qseqids, qseq_id)
}
names(genes) <- qseqids
genes_object <- AAStringSet(genes)
genes_object
#> AAStringSet object of length 2:
#> width seq names
#> [1] 117 MISIIKRFLGKRQPRQSAEHHYE...KRRVIDYLLSPIREYQAEALRER IMEHDJCA_02683
#> [2] 117 MTAITISDQEYRDFSRFLESQCG...SLTGLTDRFEMVRCNPGIIYKLK DIBHEKPI_01114
由reprex package (v2.0.1) 于 2022 年 8 月 9 日创建
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.