Rosalind簡介和共識：在Python中將長字符串寫入一行（格式化）

Question

我正在嘗試解決Rosalind上的一個問題，在給定的FASTA文件中，最多1 kb的10個序列，我需要給出共有序列和配置文件（每個核苷酸在每個核苷酸中共有多少個鹼基）。 在格式化響應的上下文中，代碼所具有的功能適用於小序列（已驗證）。

但是，當涉及大序列時，我在格式化響應時遇到問題。 無論長度如何，我期望返回的值是：

"consensus sequence"
"A: one line string of numbers without commas"
"C: one line string """" "
"G: one line string """" "
"T: one line string """" "

彼此對齊並在各自的行上對齊，或者至少使用某種格式，該格式可以讓我繼續將此格式作為一個單元來保持對齊的完整性。

但是，當我針對較大的序列運行代碼時，我會在共識序列下獲得每個單獨的字符串，並用換行符將其分解，這大概是因為字符串本身太長。 我一直在努力思考解決問題的方法，但是我的搜索無濟於事。 我正在考慮一些迭代編寫算法，該算法可以只編寫上述期望的全部內容，但可以大塊編寫。任何幫助將不勝感激。 為了完整起見，我在下面附加了我的全部代碼，並通過主體部分添加了必要的注釋。

def cons(file):
#returns consensus sequence and profile of a FASTA file
    import os
    path = os.path.abspath(os.path.expanduser(file))

    with open(path,"r") as D:
        F=D.readlines()

#initialize list of sequences, list of all strings, and a temporary storage
#list, respectively
    SEQS=[]
    mystrings=[]
    temp_seq=[]

#get a list of strings from the file, stripping the newline character
    for x in F:
        mystrings.append(x.strip("\n"))

#if the string in question is a nucleotide sequence (without ">")
#i'll store that string into a temporary variable until I run into a string
#with a ">", in which case I'll join all the strings in my temporary
#sequence list and append to my list of sequences SEQS    
    for i in range(1,len(mystrings)):
        if ">" not in mystrings[i]:
            temp_seq.append(mystrings[i])
        else:
            SEQS.append(("").join(temp_seq))
            temp_seq=[]
    SEQS.append(("").join(temp_seq))

#set up list of nucleotide counts for A,C,G and T, in that order
    ACGT=      [[0 for i in range(0,len(SEQS[0]))],
                [0 for i in range(0,len(SEQS[0]))],
                [0 for i in range(0,len(SEQS[0]))],
                [0 for i in range(0,len(SEQS[0]))]]

#assumed to be equal length sequences. Counting amount of shared nucleotides
#in each column
    for i in range(0,len(SEQS[0])-1):
        for j in range(0, len(SEQS)):
            if SEQS[j][i]=="A":
                ACGT[0][i]+=1
            elif SEQS[j][i]=="C":
                ACGT[1][i]+=1
            elif SEQS[j][i]=="G":
                ACGT[2][i]+=1
            elif SEQS[j][i]=="T":
                ACGT[3][i]+=1

    ancstr=""
    TR_ACGT=list(zip(*ACGT))
    acgt=["A: ","C: ","G: ","T: "]
    for i in range(0,len(TR_ACGT)-1):
        comp=TR_ACGT[i]
        if comp.index(max(comp))==0:
            ancstr+=("A")
        elif comp.index(max(comp))==1:
            ancstr+=("C")
        elif comp.index(max(comp))==2:
            ancstr+=("G")
        elif comp.index(max(comp))==3:
            ancstr+=("T")

'''
writing to file... trying to get it to write as
consensus sequence
A: blah(1line)
C: blah(1line)
G: blah(1line)
T: blah(line)
which works for small sequences. but for larger sequences
python keeps adding newlines if the string in question is very long...
'''


    myfile="myconsensus.txt"
    writing_strings=[acgt[i]+' '.join(str(n) for n in ACGT[i] for i in      range(0,len(ACGT))) for i in range(0,len(acgt))]
    with open(myfile,'w') as D:
        D.writelines(ancstr)
        D.writelines("\n")
        for i in range(0,len(writing_strings)):
            D.writelines(writing_strings[i])
            D.writelines("\n")

缺點（“ rosalind_cons.txt”）

Answer 1

您的代碼完全可以，但以下行除外：

writing_strings=[acgt[i]+' '.join(str(n) for n in ACGT[i] for i in      range(0,len(ACGT))) for i in range(0,len(acgt))]

您不小心復制了數據。 嘗試將其替換為：

writing_strings=[ACGT[i] + str(ACGT[i]) for i in range(0,len(ACGT))]

然后將其寫入輸出文件，如下所示：

D.write(writing_strings[i][1:-1])

這是擺脫列表中括號的一種懶惰方法。

Rosalind簡介和共識：在Python中將長字符串寫入一行（格式化）

問題描述

1 個解決方案

解決方案1
0 已采納 2016-08-06 16:45:25

Rosalind簡介和共識：在Python中將長字符串寫入一行（格式化）

問題描述

1 個解決方案

解決方案1 0 已采納 2016-08-06 16:45:25

解決方案1
0 已采納 2016-08-06 16:45:25