简体   繁体   English

如何在 R 中创建包含多个序列的 fasta 文件

[英]How to a fasta file with multiple sequences in it within R

I have been extracting fasta files from an online database (uniprot), by obtaining their accession numbers using the following library:我一直在从在线数据库(uniprot)中提取 fasta 文件,方法是使用以下库获取它们的入藏号:

    install.packages("protr")
    
    library("protr")


IDs <- c( "xxxx","AAAAA")

Proteins_IDs <- getUniProt(IDs)

#Test for this
Proteins_IDs

This works perfectly to grab me the sequences of interest in a fasta format that I can then write.这非常适合以我可以编写的 fasta 格式获取我感兴趣的序列。 The problem that I have is with writing the multiple sequences into ONE individual merged fasta file.我遇到的问题是将多个序列写入一个单独的合并 fasta 文件。 Currently, with I have determined a method of writing individual fasta files for each individual sequence that I grabbed using the code below:目前,我已经确定了一种为我使用下面的代码抓取的每个单独序列编写单独的 fasta 文件的方法:

x <- for(i in 1:length(Proteins_IDs)){
  write.fasta(Proteins_IDs[i], names=Proteins_IDs[i], file.out=paste(Proteins_IDs[i], ".fasta", sep=""))
}

The problem is this creates individual fasta files for each rather than a combined larger file containing multiple sequences.问题是这会为每个文件创建单独的 fasta 文件,而不是包含多个序列的组合较大文件。

Use the functions of the Biostrings package from Bioconductor when dealing with fasta files, and in general any kind of "biological" (DNA,RNA,AA) strings, in R:在处理 fasta 文件时,使用来自BioconductorBiostrings package 的功能,以及通常在 R 中的任何类型的“生物”(DNA、RNA、AA)字符串:

library(protr)
IDs <- c("P00750", "P00751", "P00752")
Proteins_IDs <- getUniProt(IDs)
names(Proteins_IDs) <- IDs

library(Biostrings)
multifasta <- Biostrings::AAStringSet(unlist(Proteins_IDs))

Biostrings::writeXStringSet(multifasta, "your_multifasta.fa")

Output in the file:文件中的Output:

>P00750
MDAMKRGLCCVLLLCGAVFVSPSQEIHARFRRGARSYQVICRDEKTQMIYQQHQSWLRPVLRSNRVEYCWCNSGRAQCHS
VPVKSCSEPRCFNGGTCQQALYFSDFVCQCPEGFAGKCCEIDTRATCYEDQGISYRGTWSTAESGAECTNWNSSALAQKP
YSGRRPDAIRLGLGNHNYCRNPDRDSKPWCYVFKAGKYSSEFCSTPACSEGNSDCYFGNGSAYRGTHSLTESGASCLPWN
SMILIGKVYTAQNPSAQALGLGKHNYCRNPDGDAKPWCHVLKNRRLTWEYCDVPSCSTCGLRQYSQPQFRIKGGLFADIA
SHPWQAAIFAKHRRSPGERFLCGGILISSCWILSAAHCFQERFPPHHLTVILGRTYRVVPGEEEQKFEVEKYIVHKEFDD
DTYDNDIALLQLKSDSSRCAQESSVVRTVCLPPADLQLPDWTECELSGYGKHEALSPFYSERLKEAHVRLYPSSRCTSQH
LLNRTVTDNMLCAGDTRSGGPQANLHDACQGDSGGPLVCLNDGRMTLVGIISWGLGCGQKDVPGVYTKVTNYLDWIRDNM
RP
>P00751
MGSNLSPQLCLMPFILGLLSGGVTTTPWSLARPQGSCSLEGVEIKGGSFRLLQEGQALEYVCPSGFYPYPVQTRTCRSTG
SWSTLKTQDQKTVRKAECRAIHCPRPHDFENGEYWPRSPYYNVSDEISFHCYDGYTLRGSANRTCQVNGRWSGQTAICDN
GAGYCSNPGIPIGTRKVGSQYRLEDSVTYHCSRGLTLRGSQRRTCQEGGSWSGTEPSCQDSFMYDTPQEVAEAFLSSLTE
TIEGVDAEDGHGPGEQQKRKIVLDPSGSMNIYLVLDGSDSIGASNFTGAKKCLVNLIEKVASYGVKPRYGLVTYATYPKI
WVKVSEADSSNADWVTKQLNEINYEDHKLKSGTNTKKALQAVYSMMSWPDDVPPEGWNRTRHVIILMTDGLHNMGGDPIT
VIDEIRDLLYIGKDRKNPREDYLDVYVFGVGPLVNQVNINALASKKDNEQHVFKVKDMENLEDVFYQMIDESQSLSLCGM
VWEHRKGTDYHKQPWQAKISVIRPSKGHESCMGAVVSEYFVLTAAHCFTVDDKEHSIKVSVGGEKRDLEIEVVLFHPNYN
INGKKEAGIPEFYDYDVALIKLKNKLKYGQTIRPICLPCTEGTTRALRLPPTTTCQQQKEELLPAQDIKALFVSEEEKKL
TRKEVYIKNGDKKGSCERDAQYAPGYDKVKDISEVVTPRFLCTGGVSPYADPNTCRGDSGGPLIVHKRSRFIQVGVISWG
VVDVCKNQKRQKQVPAHARDFHINLFQVLPWLKEKLQDEDLGFL
>P00752
APPIQSRIIGGRECEKNSHPWQVAIYHYSSFQCGGVLVNPKWVLTAAHCKNDNYEVWLGRHNLFENENTAQFFGVTADFP
HPGFNLSLLKXHTKADGKDYSHDLMLLRLQSPAKITDAVKVLELPTQEPELGSTCEASGWGSIEPGPDBFEFPDEIQCVQ
LTLLQNTFCABAHPBKVTESMLCAGYLPGGKDTCMGDSGGPLICNGMWQGITSWGHTPCGSANKPSIYTKLIFYLDWIND
TITENP

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何将 function 应用于文本文件中的多个 FASTA 序列? - How to apply a function to multiple FASTA sequences within a text file? 如何使用 R 计算具有多个蛋白质序列的 FASTA 文件中的氨基酸 - How to count amino acids in a FASTA file with multiple protein sequences, using R 如何使用R从FASTA文件中提取BED文件中定义的每个间隔的序列? - How can I extract sequences from a FASTA file for each of the intervals defined in a BED file using R? 如何将一组 fasta 序列转换为 R 中的一组 Xstrings - How to transform a set of fasta sequences into a set of Xstrings in R 如何根据序列ID或名称对Fasta文件中的序列进行子集化? - How to subset sequences in fasta file based on sequence ID or Name? 将 Fasta 比对文件读入 R,以便从一列中的多个序列中获取每个核苷酸 - Read Fasta alignment file into R in order to get each nucleotide from several sequences in one column 如何修复循环以从 R 中的 DNAStringSet 写入多个 FASTA 文件? - How to fix a loop to write multiple FASTA files from DNAStringSet in R? 我如何从fasta文件中分离国家和加入 - how do i separate the country and accession from within a fasta file 用R代替Fasta文件的头 - Substitute headers of fasta file in R 在 R 中写入 FASTA 文件 output - writing FASTA file output in R
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM