简体   繁体   English

连接多个.fasta文件

[英]Concatenating Multiple .fasta Files

I'm trying to concatenate hundreds of .fasta files into a single, large fasta file containing all of the sequences. 我正在尝试将数百个.fasta文件连接成一个包含所有序列的单个大型fasta文件。 I haven't found a specific method to accomplish this in the forums. 我还没有找到在论坛中完成此任务的具体方法。 I did come across this code from http://zientzilaria.heroku.com/blog/2007/10/29/merging-single-or-multiple-sequence-fasta-files , which I have adapted a bit. 我确实从http://zientzilaria.heroku.com/blog/2007/10/29/merging-single-or-multiple-sequence-fasta-files中找到了这个代码,我已经调整了一下。

Fasta.py contains the following code: Fasta.py包含以下代码:

class fasta:
    def __init__(self, name, sequence):
        self.name = name
        self.sequence = sequence

def read_fasta(file):
    items = []
    index = 0
    for line in file:
        if line.startswith(">"):
           if index >= 1:
               items.append(aninstance)
           index+=1
           name = line[:-1]
           seq = ''
           aninstance = fasta(name, seq)
        else:
           seq += line[:-1]
           aninstance = fasta(name, seq)

    items.append(aninstance)
    return items

And here is the adapted script to concatenate .fasta files: 以下是连接.fasta文件的改编脚本:

import sys
import glob
import fasta

#obtain directory containing single fasta files for query
filepattern = input('Filename pattern to match: ')

#obtain output directory
outfile = input('Filename of output file: ')

#create new output file
output = open(outfile, 'w')

#initialize lists
names = []
seqs = []

#glob.glob returns a list of files that match the pattern
for file in glob.glob(filepattern):

    print ("file: " + file)

    #we read the contents and an instance of the class is returned
    contents = fasta.read_fasta(open(file).readlines())

    #a file can contain more than one sequence so we read them in a loop
    for item in contents:
        names.append(item.name)
        seqs.append(item.sequence)

#we print the output
for i in range(len(names)):
    output.write(names[i] + '\n' + seqs[i] + '\n\n')

output.close()
print("done")

It is able to read the fasta files but the newly created output file contains no sequences. 它能够读取fasta文件,但新创建的输出文件不包含序列。 The error I receive is due to the fasta.py, which is beyond my capability to mess with: 我收到的错误是由于fasta.py,这超出了我的能力:

Traceback (most recent call last):
  File "C:\Python32\myfiles\test\3\Fasta_Concatenate.py", line 28, in <module>
    contents = fasta.read_fasta(open(file).readlines())
  File "C:\Python32\lib\fasta.py", line 18, in read_fasta
    seq += line[:-1]
UnboundLocalError: local variable 'seq' referenced before assignment

Any suggestions? 有什么建议么? Thanks! 谢谢!

I think using python for this job is overkill. 我认为使用python来完成这项工作是有点过头了。 On the command line, a quick way to concatenate single/multiple fasta files with the .fasta or .fa extensions is to simply: 在命令行上,将单个/多个fasta文件与.fasta.fa扩展名连接起来的快速方法是:

cat *.fa* > newfile.txt

The problem is in fasta.py : 问题在于fasta.py

else:
       seq += line[:-1]
       aninstance = fasta(name, seq)

Try initializing seq before at the start of read_fasta(file) . 尝试在read_fasta(file)开始之前初始化seq

EDIT: Further explanation 编辑:进一步解释

When you first call read_fasta , the first line in the file does not start with > , thus you append the first line to the string seq which has not be initialized yet (not even declared): you are appending a string (the first line) to a null value. 当你第一次调用read_fasta ,文件中的第一行不以>开头,因此你将第一行附加到尚未初始化的字符串seq (甚至没有声明):你附加一个字符串(第一行)为空值。 The error present in the stack explains the problem: 堆栈中出现的错误解释了问题:

UnboundLocalError: local variable 'seq' referenced before assignment

Not a python programer but it seems that question code tries to condense the data for each sequence in a single line and also separate sequence with a blank line. 不是python程序员,但似乎问题代码试图在一行中压缩每个序列的数据,并且还用空行分隔序列。

  >seq1
  00000000
  11111111
  >seq2
  22222222
  33333333

would become 会成为

  >seq1
  0000000011111111

  >seq2
  2222222233333333

If this is in fact needed the cat based solution above would not work. 如果这实际上是需要上面基于的解决方案将无法工作。 Otherwise the cat is the simplest and most effective solution. 否则, 是最简单,最有效的解决方案。

For windows OS via command prompt: (Note-folder should contain only required files) : 对于Windows OS,通过命令提示符:( Note-folder应该只包含必需的文件):

copy *.fasta **space** final.fasta  

Enjoy. 请享用。

The following ensures that new files always start on a new line: 以下内容确保新文件始终以新行开头:

$ awk 1 *.fasta > largefile.fasta

The solution using cat might fail on that: 使用cat的解决方案可能会失败:

$ echo -n foo > f1
$ echo bar > f2
$ cat f1 f2
foobar
$ awk 1 f1 f2
foo
bar

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM