简体   繁体   English

如何继续追加到一个列表行中直到某个字符?

[英]How to continue append into one list line until a certain character?

I'm trying to make multiple lines before a '>' character append into one list so I can convert it to a value in a dictionary. 我试图在将'>'字符追加到一个列表之前制作多行,以便将其转换为字典中的值。 For example, I'm trying to make: 例如,我试图做:

> 1
AAA
CCC
> 2

become AAACCC. 成为AAACCC。

The code is below: 代码如下:

def parse_fasta(path):
    with open(path) as thefile:
        label = []
        sequences = []
        for k, line in enumerate(thefile):
            if line.startswith('>'):
                labeler = line.strip('>').strip('\n')
                label.append(labeler)
            else:
                seqfix = ''.join(line.strip('\n'))
                sequences.append(seqfix)
    dict_version = {k: v for k, v in zip(label, sequences)}
    return dict_version
parse_fasta('small.fasta')

You can create the dictionary as you go. 您可以随时创建字典。 Here is a method for doing that. 这是一种这样做的方法。

EDIT: removed defaultdict (so no modules) 编辑:删除defaultdict(所以没有模块)

from pprint import pprint

dict_version = {}

with open('fasta_sample.txt', 'r') as f:
    for line in f:
        line = line.rstrip()

        if line.startswith('>'):
            key = line[1:]
        else:
            if key in dict_version:
                dict_version[key] += line
            else:
                dict_version[key] = line

pprint(dict_version)

The sample file: 示例文件:

>1FN3:A|PDBID|CHAIN|SEQUENCE
VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNAL
SALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR

>5OKT:A|PDBID|CHAIN|SEQUENCE
MGSSHHHHHHSSGLVPRGSHMELRVGNRYRLGRKIGSGSFGDIYLGTDIAAGEEVAIKLECVKTKHPQLHIESKIYKMMQ
GGVGIPTIRWCGAEGDYNVMVMELLGPSLEDLFNFCSRKFSLKTVLLLADQMISRIEYIHSKNFIHRDVKPDNFLMGLGK
KGNLVYIIDFGLAKKYRDARTHQHIPYRENKNLTGTARYASINTHLGIEQSRRDDLESLGYVLMYFNLGSLPWQGLKAAT
KRQKYERISEKKMSTPIEVLCKGYPSEFATYLNFCRSLRFDDKPDYSYLRQLFRNLFHRQGFSYDYVFDWNMLK*

>2PAB:A|PDBID|CHAIN|SEQUENCE
GPTGTGESKCPLMVKVLDAVRGSPAINVAVHVFRKAADDTWEPFASGKTSESGELHGLTTEEQFVEGIYKVEIDTKSYWK
ALGISPFHEHAEVVFTANDSGPRRYTIAALLSPYSYSTTAVVTNPKE*

>3IDP:B|PDBID|CHAIN|SEQUENCE
HHHHHHDRNRMKTLGRRDSSDDWEIPDGQITVGQRIGSGSFGTVYKGKWHGDVAVKMLNVTAPTPQQLQAFKNEVGVLRK
TRHVNILLFMGYSTKPQLAIVTQWCEGSSLYHHLHIIETKFEMIKLIDIARQTAQGMDYLHAKSIIHRDLKSNNIFLHED
LTVKIGDFGLATEKSRWSGSHQFEQLSGSILWMAPEVIRMQDKNPYSFQSDVYAFGIVLYELMTGQLPYSNINNRDQIIF
MVGRGYLSPDLSKVRSNCPKAMKRLMAECLKKKRDERPLFPQILASIELLARSLPKIHRS

>4QUD:A|PDBID|CHAIN|SEQUENCE
MENTENSVDSKSIKNLEPKIIHGSESMDSGISLDNSYKMDYPEMGLCIIINNKNFHKSTGMTSRSGTDVDAANLRETFRN
LKYEVRNKNDLTREEIVELMRDVSKEDHSKRSSFVCVLLSHGEEGIIFGTNGPVDLKKIFNFFRGDRCRSLTGKPKLFII
QACRGTELDCGIETDSGVDDDMACHKIPVEADFLYAYSTAPGYYSWRNSKDGSWFIQSLCAMLKQYADKLEFMHILTRVN
RKVATEFESFSFDATFHAKKQIPCIVSMLTKELYFYH

Pretty print of the dictionary created is: 创建的字典的漂亮印刷形式是:

{'1FN3:A|PDBID|CHAIN|SEQUENCE': 'VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR',
 '2PAB:A|PDBID|CHAIN|SEQUENCE': 'GPTGTGESKCPLMVKVLDAVRGSPAINVAVHVFRKAADDTWEPFASGKTSESGELHGLTTEEQFVEGIYKVEIDTKSYWKALGISPFHEHAEVVFTANDSGPRRYTIAALLSPYSYSTTAVVTNPKE*',
 '3IDP:B|PDBID|CHAIN|SEQUENCE': 'HHHHHHDRNRMKTLGRRDSSDDWEIPDGQITVGQRIGSGSFGTVYKGKWHGDVAVKMLNVTAPTPQQLQAFKNEVGVLRKTRHVNILLFMGYSTKPQLAIVTQWCEGSSLYHHLHIIETKFEMIKLIDIARQTAQGMDYLHAKSIIHRDLKSNNIFLHEDLTVKIGDFGLATEKSRWSGSHQFEQLSGSILWMAPEVIRMQDKNPYSFQSDVYAFGIVLYELMTGQLPYSNINNRDQIIFMVGRGYLSPDLSKVRSNCPKAMKRLMAECLKKKRDERPLFPQILASIELLARSLPKIHRS',
 '4QUD:A|PDBID|CHAIN|SEQUENCE': 'MENTENSVDSKSIKNLEPKIIHGSESMDSGISLDNSYKMDYPEMGLCIIINNKNFHKSTGMTSRSGTDVDAANLRETFRNLKYEVRNKNDLTREEIVELMRDVSKEDHSKRSSFVCVLLSHGEEGIIFGTNGPVDLKKIFNFFRGDRCRSLTGKPKLFIIQACRGTELDCGIETDSGVDDDMACHKIPVEADFLYAYSTAPGYYSWRNSKDGSWFIQSLCAMLKQYADKLEFMHILTRVNRKVATEFESFSFDATFHAKKQIPCIVSMLTKELYFYH',
 '5OKT:A|PDBID|CHAIN|SEQUENCE': 'MGSSHHHHHHSSGLVPRGSHMELRVGNRYRLGRKIGSGSFGDIYLGTDIAAGEEVAIKLECVKTKHPQLHIESKIYKMMQGGVGIPTIRWCGAEGDYNVMVMELLGPSLEDLFNFCSRKFSLKTVLLLADQMISRIEYIHSKNFIHRDVKPDNFLMGLGKKGNLVYIIDFGLAKKYRDARTHQHIPYRENKNLTGTARYASINTHLGIEQSRRDDLESLGYVLMYFNLGSLPWQGLKAATKRQKYERISEKKMSTPIEVLCKGYPSEFATYLNFCRSLRFDDKPDYSYLRQLFRNLFHRQGFSYDYVFDWNMLK*'}

EDIT: To work the solution following your try: 编辑:在尝试之后解决问题的方法:

from pprint import pprint

def parse_fasta(path):
    with open(path) as thefile:
        label = []
        sequences = ''
        total_seq = []

        for line in thefile:
            line = line.strip()
            if len(line) == 0:
                continue
            if line.startswith('>'):
                line = line.strip('>')
                label.append(line)
                if len(sequences) > 0:
                    total_seq.append(sequences)
                    sequences = ''
            else:
                sequences += line

        total_seq.append(sequences)

    dict_version = {k: v for k, v in zip(label, total_seq)}
    return dict_version

d = parse_fasta('fasta_sample.txt')

pprint(d)

You'll see I made some changes to get the correct output. 您会看到我做了一些更改以获取正确的输出。 I added an array total_seq to hold the sequences for each sequence header. 我添加了一个数组total_seq来保存每个序列头的序列。 (You didn't have this and was a problem in your solution). (您没有这个,这是您的解决方案中的问题)。 The joins in your code were not doing anything. 您代码中的joins没有执行任何操作。 The value was just a single string although you had the right idea. 尽管您有正确的想法,但该值只是一个字符串。 You'll see in the revised code the join was done to join the accumulated sequences for one header id into one string of fasta characters. 您将在修改后的代码中看到join做的目的是参加一个头标识累计序列成FASTA字符一个字符串。

I tested for blank lines and did a continue if the line was blank, ( len(line) == 0 ). 我测试了空白行,如果行是空白,则continuelen(line) == 0 )。

There was a test if len(sequences) > 0 to see if any sequences had been seen yet. 测试if len(sequences) > 0以查看是否有任何序列。 Which they wouldn't on the first record. 他们不会在第一张唱片上。 It would see the ID before it had seen any sequences. 它会在看到任何序列之前先看到ID。

After the for loop completes, it is necessary to add the last sequence for循环完成后,有必要添加最后一个序列

total_seq.append(sequences)

since all other sequences except the last are added to the total_seq when a new ID is detected. 因为检测到新ID时,除最后一个序列外的所有其他序列都会添加到total_seq中。

I hope this explanation is helpful as it more closely follows your code. 我希望这种解释会有所帮助,因为它可以更紧密地遵循您的代码。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 连接列表直到某个字符 - Concatenate a list until a certain character 如何将打开文件中空行之前的 append 行添加到字典中,然后继续追加以下行直到下一个空行? - How to append lines in open file before empty line to a dictionary, and then continue appending the following lines until the next empty line? 搜索一行中的字符并将整行附加到列表中 - Search for a character in a line and append entire line to a list 如何继续调用 __init__ 函数直到满足某些条件? - How to continue calling __init__ funtion until certain condition meet? 在前一行中将n行最多追加到某个字符 - Append n lines up to a certain character in the previous line 将带有“ \\ LF”的行添加到一行中,直到找到“ \\ CR \\ LF”? - append lines with “\LF” into one line, until finds “\CR\LF”? 如何逐行阅读直到某行? - How to read in lines until a certain line? 如何遍历两个列表字典,以将唯一键值附加到一个列表字典,并通过共享重复值条目继续执行其他字典? - How to loop through two list dictionaries to append unique key values to one list dict and continue through other with shared repeating value entries? 如何更改列表中的某个字符? - How to change a certain character in a list? 如何将 append 换行到列表 - How to append new line to the list
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM