連接多個.fasta文件

Question

我正在嘗試將數百個.fasta文件連接成一個包含所有序列的單個大型fasta文件。 我還沒有找到在論壇中完成此任務的具體方法。 我確實從http://zientzilaria.heroku.com/blog/2007/10/29/merging-single-or-multiple-sequence-fasta-files中找到了這個代碼，我已經調整了一下。

Fasta.py包含以下代碼：

class fasta:
    def __init__(self, name, sequence):
        self.name = name
        self.sequence = sequence

def read_fasta(file):
    items = []
    index = 0
    for line in file:
        if line.startswith(">"):
           if index >= 1:
               items.append(aninstance)
           index+=1
           name = line[:-1]
           seq = ''
           aninstance = fasta(name, seq)
        else:
           seq += line[:-1]
           aninstance = fasta(name, seq)

    items.append(aninstance)
    return items

以下是連接.fasta文件的改編腳本：

import sys
import glob
import fasta

#obtain directory containing single fasta files for query
filepattern = input('Filename pattern to match: ')

#obtain output directory
outfile = input('Filename of output file: ')

#create new output file
output = open(outfile, 'w')

#initialize lists
names = []
seqs = []

#glob.glob returns a list of files that match the pattern
for file in glob.glob(filepattern):

    print ("file: " + file)

    #we read the contents and an instance of the class is returned
    contents = fasta.read_fasta(open(file).readlines())

    #a file can contain more than one sequence so we read them in a loop
    for item in contents:
        names.append(item.name)
        seqs.append(item.sequence)

#we print the output
for i in range(len(names)):
    output.write(names[i] + '\n' + seqs[i] + '\n\n')

output.close()
print("done")

它能夠讀取fasta文件，但新創建的輸出文件不包含序列。 我收到的錯誤是由於fasta.py，這超出了我的能力：

Traceback (most recent call last):
  File "C:\Python32\myfiles\test\3\Fasta_Concatenate.py", line 28, in <module>
    contents = fasta.read_fasta(open(file).readlines())
  File "C:\Python32\lib\fasta.py", line 18, in read_fasta
    seq += line[:-1]
UnboundLocalError: local variable 'seq' referenced before assignment

有什么建議么？ 謝謝！

Answer 1

我認為使用python來完成這項工作是有點過頭了。 在命令行上，將單個/多個fasta文件與.fasta或.fa擴展名連接起來的快速方法是：

cat *.fa* > newfile.txt

Answer 2

問題在於fasta.py ：

else:
       seq += line[:-1]
       aninstance = fasta(name, seq)

嘗試在read_fasta(file)開始之前初始化seq 。

編輯：進一步解釋

當你第一次調用read_fasta ，文件中的第一行不以>開頭，因此你將第一行附加到尚未初始化的字符串seq （甚至沒有聲明）：你附加一個字符串（第一行）為空值。 堆棧中出現的錯誤解釋了問題：

UnboundLocalError: local variable 'seq' referenced before assignment

Answer 3

不是python程序員，但似乎問題代碼試圖在一行中壓縮每個序列的數據，並且還用空行分隔序列。

會成為

  >seq1
  0000000011111111

  >seq2
  2222222233333333

如果這實際上是需要上面基於貓的解決方案將無法工作。 否則，貓是最簡單，最有效的解決方案。

Answer 4

對於Windows OS，通過命令提示符:( Note-folder應該只包含必需的文件）：

copy *.fasta **space** final.fasta

請享用。

Answer 5

以下內容確保新文件始終以新行開頭：

$ awk 1 *.fasta > largefile.fasta

使用cat的解決方案可能會失敗：

$ echo -n foo > f1
$ echo bar > f2
$ cat f1 f2
foobar
$ awk 1 f1 f2
foo
bar

連接多個.fasta文件

問題描述

5 個解決方案

解決方案1
8 2012-07-31 22:59:07

解決方案2
1 2012-07-30 17:23:52

解決方案3
1 2012-09-29 16:48:17

解決方案4
1 2012-12-08 10:58:41

解決方案5
0 2019-05-24 09:47:55

連接多個.fasta文件

問題描述

5 個解決方案

解決方案1 8 2012-07-31 22:59:07

解決方案2 1 2012-07-30 17:23:52

解決方案3 1 2012-09-29 16:48:17

解決方案4 1 2012-12-08 10:58:41

解決方案5 0 2019-05-24 09:47:55

解決方案1
8 2012-07-31 22:59:07

解決方案2
1 2012-07-30 17:23:52

解決方案3
1 2012-09-29 16:48:17

解決方案4
1 2012-12-08 10:58:41

解決方案5
0 2019-05-24 09:47:55