简体   繁体   English

从文件中读取序列作为字符串而不是单个字母 python

[英]reading in sequences from file as strings not individual letters python

I have three files, they look like this:我有三个文件,它们看起来像这样:

>xx_oneFish |xxx
AAAAAAA
>xx_twoFish |xxx
CCCCCC
>xx_redFish |xxx
TTTTTT
>xx_blueFish |xxx
GGGGGG
>xx_oneFish |xxx
aaaa
>xx_twoFish |xxx
cccc
>xx_redFish |xxx
tt
>xx_blueFish |xxx
gg

I am trying to read these files using python to get this result:我正在尝试使用 python 读取这些文件以获得以下结果:

[[ 'aaaa', 'cccc'], ['tt', 'gg'], [ 'AAAAAAA', 'CCCCCC', 'TTTTTT', 'GGGGGG']]

Here is my code:这是我的代码:

testNames = []
testSequences = []
counter = 0
for filename in os.listdir("/PATH/TO/FILE"): #go to directory where aligned files are kept
    if filename.endswith(".txt"): #open files which have been aligned with MAFFT
        fastaFile = open(filename, 'r') 
        testNames.append([])
        testSequences.append([])
        for line in fastaFile: 
            line = line.strip() 
            if len(line)>0: 
                if line[0] == '>':  
                    testNames[counter].append(line[1:]) 
                    testSequences.append("") 
                    currentTaxon = len(testSequences)-1 
                else: 
                    testSequences[currentTaxon] += line 
        counter +=1

print testSequences

This gives me this result:这给了我这个结果:

[[], 'aaaa', 'cccc', [], 'tt', 'gg', [], 'AAAAAAA', 'CCCCCC', 'TTTTTT', 'GGGGGG']

I tried to change my code to but the strings inside the brackets by taking out the 14th line:我试图通过取出第 14 行将我的代码更改为但括号内的字符串:

testNames = []
testSequences = []
counter = 0
for filename in os.listdir("/PATH/TO/FILE"): #go to directory where aligned files are kept
    if filename.endswith(".txt"): #open files which have been aligned with MAFFT
        fastaFile = open(filename, 'r') 
        testNames.append([])
        testSequences.append([])
        for line in fastaFile: 
            line = line.strip() 
            if len(line)>0: 
                if line[0] == '>':  
                    testNames[counter].append(line[1:]) 
                    currentTaxon = len(testSequences)-1 
                else: 
                    testSequences[currentTaxon] += line 
        counter +=1

print testSequences

Now I get this result:现在我得到这个结果:

[['a', 'a', 'a', 'a', 'c', 'c', 'c', 'c'], ['t', 't', 'g', 'g'], ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'C', 'C', 'C', 'C', 'C', 'C', 'T', 'T', 'T', 'T', 'T', 'T', 'G', 'G', 'G', 'G', 'G', 'G']]

How can I fix my code to get sequences read in as strings, inside the nested list?如何修复我的代码以在嵌套列表中将序列作为字符串读入?

I want to keep the contents of the list testNames as is:我想保持列表 testNames 的内容不变:

[['xx_oneFish |xxx', 'xx_twoFish |xxx'], ['xx_redFish |xxx', 'xx_blueFish |xxx'], ['xx_oneFish |xxx', 'xx_twoFish |xxx', 'xx_redFish |xxx', 'xx_blueFish |xxx']]

Try this :尝试这个 :

import os
testSequences = []
testNames = []
for filename in os.listdir("./"): #go to directory where aligned files are kept
    if filename.endswith(".txt"): #open files which have been aligned with MAFFT
        fastaFile = open(filename, 'r') 
        temp_sub_list_names = []
        temp_sub_list_seq = []
        for line in fastaFile:
            line = line.strip()
            if line:
                if not line.startswith('>'):
                    temp_sub_list_seq.append(line)
                else:
                    temp_sub_list_names.append(line)
        testSequences.append(temp_sub_list_seq)
        testNames.append(temp_sub_list_names)

print (testSequences)
print (testNames)

Output :输出

[['tt', 'gg'], ['AAAAAAA', 'CCCCCC', 'TTTTTT', 'GGGGGG'], ['aaaa', 'cccc']]
[['>xx_redFish |xxx', '>xx_blueFish |xxx'], ['>xx_oneFish |xxx', '>xx_twoFish |xxx', '>xx_redFish |xxx', '>xx_blueFish |xxx'], ['>xx_oneFish |xxx', '>xx_twoFish |xxx']]

Note : 1. This would work if you had the script in the same folder where the text files are.注意: 1. 如果您将脚本放在文本文件所在的同一文件夹中,这将起作用。 2. This doesn't check for the expected values in the lines exactly happening after those lines starting with '>' . 2. 这不会检查以'>'开头的行之后恰好发生的行中的预期值。 That being said, if one of your .txt file is like this :话虽如此,如果您的.txt文件之一是这样的:

>xx_oneFish |xxx
aaaa
bbbb
dddd
>xx_twoFish |xxx
cccc

For that file, the sub-list produced inside testSequences would be ['aaaa', 'bbbb', 'dddd', 'cccc']对于该文件,在testSequences中生成的子列表将是['aaaa', 'bbbb', 'dddd', 'cccc']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM