[英]How to continue append into one list line until a certain character?
I'm trying to make multiple lines before a '>' character append into one list so I can convert it to a value in a dictionary. 我试图在将'>'字符追加到一个列表之前制作多行,以便将其转换为字典中的值。 For example, I'm trying to make:
例如,我试图做:
> 1
AAA
CCC
> 2
become AAACCC. 成为AAACCC。
The code is below: 代码如下:
def parse_fasta(path):
with open(path) as thefile:
label = []
sequences = []
for k, line in enumerate(thefile):
if line.startswith('>'):
labeler = line.strip('>').strip('\n')
label.append(labeler)
else:
seqfix = ''.join(line.strip('\n'))
sequences.append(seqfix)
dict_version = {k: v for k, v in zip(label, sequences)}
return dict_version
parse_fasta('small.fasta')
You can create the dictionary as you go. 您可以随时创建字典。 Here is a method for doing that.
这是一种这样做的方法。
EDIT: removed defaultdict (so no modules) 编辑:删除defaultdict(所以没有模块)
from pprint import pprint
dict_version = {}
with open('fasta_sample.txt', 'r') as f:
for line in f:
line = line.rstrip()
if line.startswith('>'):
key = line[1:]
else:
if key in dict_version:
dict_version[key] += line
else:
dict_version[key] = line
pprint(dict_version)
The sample file: 示例文件:
>1FN3:A|PDBID|CHAIN|SEQUENCE
VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNAL
SALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR
>5OKT:A|PDBID|CHAIN|SEQUENCE
MGSSHHHHHHSSGLVPRGSHMELRVGNRYRLGRKIGSGSFGDIYLGTDIAAGEEVAIKLECVKTKHPQLHIESKIYKMMQ
GGVGIPTIRWCGAEGDYNVMVMELLGPSLEDLFNFCSRKFSLKTVLLLADQMISRIEYIHSKNFIHRDVKPDNFLMGLGK
KGNLVYIIDFGLAKKYRDARTHQHIPYRENKNLTGTARYASINTHLGIEQSRRDDLESLGYVLMYFNLGSLPWQGLKAAT
KRQKYERISEKKMSTPIEVLCKGYPSEFATYLNFCRSLRFDDKPDYSYLRQLFRNLFHRQGFSYDYVFDWNMLK*
>2PAB:A|PDBID|CHAIN|SEQUENCE
GPTGTGESKCPLMVKVLDAVRGSPAINVAVHVFRKAADDTWEPFASGKTSESGELHGLTTEEQFVEGIYKVEIDTKSYWK
ALGISPFHEHAEVVFTANDSGPRRYTIAALLSPYSYSTTAVVTNPKE*
>3IDP:B|PDBID|CHAIN|SEQUENCE
HHHHHHDRNRMKTLGRRDSSDDWEIPDGQITVGQRIGSGSFGTVYKGKWHGDVAVKMLNVTAPTPQQLQAFKNEVGVLRK
TRHVNILLFMGYSTKPQLAIVTQWCEGSSLYHHLHIIETKFEMIKLIDIARQTAQGMDYLHAKSIIHRDLKSNNIFLHED
LTVKIGDFGLATEKSRWSGSHQFEQLSGSILWMAPEVIRMQDKNPYSFQSDVYAFGIVLYELMTGQLPYSNINNRDQIIF
MVGRGYLSPDLSKVRSNCPKAMKRLMAECLKKKRDERPLFPQILASIELLARSLPKIHRS
>4QUD:A|PDBID|CHAIN|SEQUENCE
MENTENSVDSKSIKNLEPKIIHGSESMDSGISLDNSYKMDYPEMGLCIIINNKNFHKSTGMTSRSGTDVDAANLRETFRN
LKYEVRNKNDLTREEIVELMRDVSKEDHSKRSSFVCVLLSHGEEGIIFGTNGPVDLKKIFNFFRGDRCRSLTGKPKLFII
QACRGTELDCGIETDSGVDDDMACHKIPVEADFLYAYSTAPGYYSWRNSKDGSWFIQSLCAMLKQYADKLEFMHILTRVN
RKVATEFESFSFDATFHAKKQIPCIVSMLTKELYFYH
Pretty print of the dictionary created is: 创建的字典的漂亮印刷形式是:
{'1FN3:A|PDBID|CHAIN|SEQUENCE': 'VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR',
'2PAB:A|PDBID|CHAIN|SEQUENCE': 'GPTGTGESKCPLMVKVLDAVRGSPAINVAVHVFRKAADDTWEPFASGKTSESGELHGLTTEEQFVEGIYKVEIDTKSYWKALGISPFHEHAEVVFTANDSGPRRYTIAALLSPYSYSTTAVVTNPKE*',
'3IDP:B|PDBID|CHAIN|SEQUENCE': 'HHHHHHDRNRMKTLGRRDSSDDWEIPDGQITVGQRIGSGSFGTVYKGKWHGDVAVKMLNVTAPTPQQLQAFKNEVGVLRKTRHVNILLFMGYSTKPQLAIVTQWCEGSSLYHHLHIIETKFEMIKLIDIARQTAQGMDYLHAKSIIHRDLKSNNIFLHEDLTVKIGDFGLATEKSRWSGSHQFEQLSGSILWMAPEVIRMQDKNPYSFQSDVYAFGIVLYELMTGQLPYSNINNRDQIIFMVGRGYLSPDLSKVRSNCPKAMKRLMAECLKKKRDERPLFPQILASIELLARSLPKIHRS',
'4QUD:A|PDBID|CHAIN|SEQUENCE': 'MENTENSVDSKSIKNLEPKIIHGSESMDSGISLDNSYKMDYPEMGLCIIINNKNFHKSTGMTSRSGTDVDAANLRETFRNLKYEVRNKNDLTREEIVELMRDVSKEDHSKRSSFVCVLLSHGEEGIIFGTNGPVDLKKIFNFFRGDRCRSLTGKPKLFIIQACRGTELDCGIETDSGVDDDMACHKIPVEADFLYAYSTAPGYYSWRNSKDGSWFIQSLCAMLKQYADKLEFMHILTRVNRKVATEFESFSFDATFHAKKQIPCIVSMLTKELYFYH',
'5OKT:A|PDBID|CHAIN|SEQUENCE': 'MGSSHHHHHHSSGLVPRGSHMELRVGNRYRLGRKIGSGSFGDIYLGTDIAAGEEVAIKLECVKTKHPQLHIESKIYKMMQGGVGIPTIRWCGAEGDYNVMVMELLGPSLEDLFNFCSRKFSLKTVLLLADQMISRIEYIHSKNFIHRDVKPDNFLMGLGKKGNLVYIIDFGLAKKYRDARTHQHIPYRENKNLTGTARYASINTHLGIEQSRRDDLESLGYVLMYFNLGSLPWQGLKAATKRQKYERISEKKMSTPIEVLCKGYPSEFATYLNFCRSLRFDDKPDYSYLRQLFRNLFHRQGFSYDYVFDWNMLK*'}
EDIT: To work the solution following your try: 编辑:在尝试之后解决问题的方法:
from pprint import pprint
def parse_fasta(path):
with open(path) as thefile:
label = []
sequences = ''
total_seq = []
for line in thefile:
line = line.strip()
if len(line) == 0:
continue
if line.startswith('>'):
line = line.strip('>')
label.append(line)
if len(sequences) > 0:
total_seq.append(sequences)
sequences = ''
else:
sequences += line
total_seq.append(sequences)
dict_version = {k: v for k, v in zip(label, total_seq)}
return dict_version
d = parse_fasta('fasta_sample.txt')
pprint(d)
You'll see I made some changes to get the correct output. 您会看到我做了一些更改以获取正确的输出。 I added an array
total_seq
to hold the sequences for each sequence header. 我添加了一个数组
total_seq
来保存每个序列头的序列。 (You didn't have this and was a problem in your solution). (您没有这个,这是您的解决方案中的问题)。 The
joins
in your code were not doing anything. 您代码中的
joins
没有执行任何操作。 The value was just a single string although you had the right idea. 尽管您有正确的想法,但该值只是一个字符串。 You'll see in the revised code the
join
was done to join the accumulated sequences for one header id into one string of fasta characters. 您将在修改后的代码中看到
join
做的目的是参加一个头标识累计序列成FASTA字符一个字符串。
I tested for blank lines and did a continue
if the line was blank, ( len(line) == 0
). 我测试了空白行,如果行是空白,则
continue
( len(line) == 0
)。
There was a test if len(sequences) > 0
to see if any sequences had been seen yet. 测试
if len(sequences) > 0
以查看是否有任何序列。 Which they wouldn't on the first record. 他们不会在第一张唱片上。 It would see the ID before it had seen any sequences.
它会在看到任何序列之前先看到ID。
After the for
loop completes, it is necessary to add the last sequence for
循环完成后,有必要添加最后一个序列
total_seq.append(sequences)
since all other sequences except the last are added to the total_seq when a new ID is detected. 因为检测到新ID时,除最后一个序列外的所有其他序列都会添加到total_seq中。
I hope this explanation is helpful as it more closely follows your code. 我希望这种解释会有所帮助,因为它可以更紧密地遵循您的代码。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.