[英]how to extract genes from genbank file in R
I ask this question because I don't really know how to do it.我问这个问题是因为我真的不知道该怎么做。
I have a genome in a gb format (YJ016_I.gb) so I want to import in R and then export all the genes in nucleotide format, or just take one of the sequence using the name of the gene.我有一个 gb 格式(YJ016_I.gb)的基因组,所以我想在 R 中导入,然后以核苷酸格式导出所有基因,或者只使用基因名称获取其中一个序列。
library(genbankr)
library(stringr)
library(purrr)
gb <- genbankr::readGenBank("YJ016_I.gb")
GENES <- GenomicFeatures::genes(gb)
GenesDF <- data.frame(GENES)
seqnames start end width strand type gene locus_tag old_locus_tag loctype pseudo gene_synonym gene_id
1 I 353 787 435 - gene mioC BJE04_RS00275 VV0001, .... normal FALSE NA BJE04_RS00275
2 I 835 2196 1362 - gene mnmE BJE04_RS00280 VV0002, .... normal FALSE NA BJE04_RS00280
3 I 2314 3933 1620 - gene yidC BJE04_RS00285 VV0003, .... normal FALSE NA BJE04_RS00285
4 I 3936 4193 258 - gene yidD BJE04_RS23545 normal FALSE NA BJE04_RS23545
5 I 4160 4516 357 - gene rnpA BJE04_RS00290 VV0004, .... normal FALSE NA BJE04_RS00290
6 I 4530 4670 141 - gene rpmH BJE04_RS00295 VV0005, .... normal FALSE NA BJE04_RS00295
if I want to extract the sequence of BJE04_RS00275 in locus_tag, or in the other way export all genes of the genbank file如果我想在locus_tag中提取BJE04_RS00275的序列,或者以其他方式导出genbank文件的所有基因
I mean我是说
>BJE04_RS00275
aatgc
>BJE04_RS00280
ggcta
>BJE04_5000
atggcaa
how can I do it with R or if you have any solution in any language o program !!!我怎么能用 R 或者如果你有任何语言或程序的任何解决方案!
Thanks谢谢
I do not have access to your specific file, so I am using the example file provided by the genbankr
package.我无权访问您的特定文件,因此我使用的是
genbankr
包提供的示例文件。 You will need Biostrings
to write the sequence as fasta file.您将需要
Biostrings
将序列写入 fasta 文件。 To only write the sequence(s) of a particular locus_tag, just subset GENES to that particular one (eg subset(GENES, locus_tag == "BJE04_RS00275")
in your example).要仅写入特定 locus_tag 的序列,只需将 GENES 子集添加到该特定序列(例如,您的示例中的
subset(GENES, locus_tag == "BJE04_RS00275")
)。
library(genbankr)
library(GenomicFeatures)
library(Biostrings)
gb <- readGenBank(system.file("sample.gbk", package="genbankr"))
GENES <- genes(gb)
res <- getSeq(gb@sequence, setNames(GENES, GENES$gene_id))
writeXStringSet(res, "YJ016_I_genes.fa")
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.