从文件中读取序列作为字符串而不是单个字母 python

Question

I have three files, they look like this:我有三个文件，它们看起来像这样：

>xx_oneFish |xxx
AAAAAAA
>xx_twoFish |xxx
CCCCCC
>xx_redFish |xxx
TTTTTT
>xx_blueFish |xxx
GGGGGG

>xx_oneFish |xxx
aaaa
>xx_twoFish |xxx
cccc

>xx_redFish |xxx
tt
>xx_blueFish |xxx
gg

I am trying to read these files using python to get this result:我正在尝试使用 python 读取这些文件以获得以下结果：

[[ 'aaaa', 'cccc'], ['tt', 'gg'], [ 'AAAAAAA', 'CCCCCC', 'TTTTTT', 'GGGGGG']]

Here is my code:这是我的代码：

testNames = []
testSequences = []
counter = 0
for filename in os.listdir("/PATH/TO/FILE"): #go to directory where aligned files are kept
    if filename.endswith(".txt"): #open files which have been aligned with MAFFT
        fastaFile = open(filename, 'r') 
        testNames.append([])
        testSequences.append([])
        for line in fastaFile: 
            line = line.strip() 
            if len(line)>0: 
                if line[0] == '>':  
                    testNames[counter].append(line[1:]) 
                    testSequences.append("") 
                    currentTaxon = len(testSequences)-1 
                else: 
                    testSequences[currentTaxon] += line 
        counter +=1

print testSequences

This gives me this result:这给了我这个结果：

[[], 'aaaa', 'cccc', [], 'tt', 'gg', [], 'AAAAAAA', 'CCCCCC', 'TTTTTT', 'GGGGGG']

I tried to change my code to but the strings inside the brackets by taking out the 14th line:我试图通过取出第 14 行将我的代码更改为但括号内的字符串：

testNames = []
testSequences = []
counter = 0
for filename in os.listdir("/PATH/TO/FILE"): #go to directory where aligned files are kept
    if filename.endswith(".txt"): #open files which have been aligned with MAFFT
        fastaFile = open(filename, 'r') 
        testNames.append([])
        testSequences.append([])
        for line in fastaFile: 
            line = line.strip() 
            if len(line)>0: 
                if line[0] == '>':  
                    testNames[counter].append(line[1:]) 
                    currentTaxon = len(testSequences)-1 
                else: 
                    testSequences[currentTaxon] += line 
        counter +=1

print testSequences

Now I get this result:现在我得到这个结果：

[['a', 'a', 'a', 'a', 'c', 'c', 'c', 'c'], ['t', 't', 'g', 'g'], ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'C', 'C', 'C', 'C', 'C', 'C', 'T', 'T', 'T', 'T', 'T', 'T', 'G', 'G', 'G', 'G', 'G', 'G']]

How can I fix my code to get sequences read in as strings, inside the nested list?如何修复我的代码以在嵌套列表中将序列作为字符串读入？

I want to keep the contents of the list testNames as is:我想保持列表 testNames 的内容不变：

[['xx_oneFish |xxx', 'xx_twoFish |xxx'], ['xx_redFish |xxx', 'xx_blueFish |xxx'], ['xx_oneFish |xxx', 'xx_twoFish |xxx', 'xx_redFish |xxx', 'xx_blueFish |xxx']]

Answer 1

Try this :尝试这个：

import os
testSequences = []
testNames = []
for filename in os.listdir("./"): #go to directory where aligned files are kept
    if filename.endswith(".txt"): #open files which have been aligned with MAFFT
        fastaFile = open(filename, 'r') 
        temp_sub_list_names = []
        temp_sub_list_seq = []
        for line in fastaFile:
            line = line.strip()
            if line:
                if not line.startswith('>'):
                    temp_sub_list_seq.append(line)
                else:
                    temp_sub_list_names.append(line)
        testSequences.append(temp_sub_list_seq)
        testNames.append(temp_sub_list_names)

print (testSequences)
print (testNames)

Output :输出：

[['tt', 'gg'], ['AAAAAAA', 'CCCCCC', 'TTTTTT', 'GGGGGG'], ['aaaa', 'cccc']]
[['>xx_redFish |xxx', '>xx_blueFish |xxx'], ['>xx_oneFish |xxx', '>xx_twoFish |xxx', '>xx_redFish |xxx', '>xx_blueFish |xxx'], ['>xx_oneFish |xxx', '>xx_twoFish |xxx']]

Note : 1. This would work if you had the script in the same folder where the text files are.注意： 1. 如果您将脚本放在文本文件所在的同一文件夹中，这将起作用。 2. This doesn't check for the expected values in the lines exactly happening after those lines starting with '>' . 2. 这不会检查以'>'开头的行之后恰好发生的行中的预期值。 That being said, if one of your .txt file is like this :话虽如此，如果您的.txt文件之一是这样的：

>xx_oneFish |xxx
aaaa
bbbb
dddd
>xx_twoFish |xxx
cccc

For that file, the sub-list produced inside testSequences would be ['aaaa', 'bbbb', 'dddd', 'cccc']对于该文件，在testSequences中生成的子列表将是['aaaa', 'bbbb', 'dddd', 'cccc']

从文件中读取序列作为字符串而不是单个字母 python

问题描述

1 个解决方案

解决方案1
1 已采纳 2019-11-28 18:49:01

从文件中读取序列作为字符串而不是单个字母 python

问题描述

1 个解决方案

解决方案1 1 已采纳 2019-11-28 18:49:01

解决方案1
1 已采纳 2019-11-28 18:49:01