[英]How to grab internal text from matching lines in multiline text in python?
我有一个名为test.txt
文本文件。 我想从test.txt
中抓取以>lcl
开头的行,然后在locus
标记后和>lcl
括号中提取值]
。 我想对location
之后的值做同样的事情。 我想要的结果如下所示。 如何在python中做到这一点?
理想的结果
SS1G_08319 <504653..>506706
SS1G_12233 complement(<502136..>503461)
SS1G_02099 <2692251..>2693298
SS1G_05227 complement(<1032740..>1033620)
的test.txt
>lcl|NW_001820825.1_gene_208 [locus_tag=SS1G_08319] [db_xref=GeneID:5486863] [partial=5',3'] [location=<504653..>506706] [gbkey=Gene]
ATGGGCAAAGCTTCTAGGAATAAGACGAAGCATCGCGCTGATCCTACCGCAAAAACTGTTAAGCCACCCA
CTGACCCAGAGCTTGCAGCAATTCGAGTTAACAAAATTCTGCCAATTCTCCAAGATTTACAAAGTGCAGA
CCAGTCAAAGAGATCAACTGCTGCAACTGCCATTGCGAACCTCGTTGACGATACAAAATGTCGAAAGTTA
TTCTTGAGAGAGCAAATTGTTCGTATTCTACTCGAACAAACCCTTACAGACTCAAGCATGGAAACTAGAA
>lcl|NW_001820817.1_gene_205 [locus_tag=SS1G_12233] [db_xref=GeneID:5483157] [partial=5',3'] [location=complement(<502136..>503461)] [gbkey=Gene]
ATGATCTGTAATACGCTCGGTGTTCCACCCTGCAACAGAATTCTTAAGAAATTCTCCGTTGGCGAGAGTC
GTCTCGAAATTCAAGACTCAGTACGAGGCAAAGATGTCTACATCATTCAATCGGGTGGAGGAAAGGCCAA
TGATCACTTCGTGGATCTTTGCATTATGATCTCCGCATGCAAAACTGGCTCTGCCAAGCGCGTCACTGTC
GTCCTTCCTTTGTTTCCTTATTCACGACAACCTGATCTGCCATACAACAAGATTGGCGCACCACTTGCCA
>lcl|NW_001820834.1_gene_1034 [locus_tag=SS1G_02099] [db_xref=GeneID:5493612] [partial=5',3'] [location=<2692251..>2693298] [gbkey=Gene]
ATGGCTTCTGTTTACAAGTCATTATCAAAGACCTCTGGTCATAAAGAAGAAACCCCGACTGGTGTCAAGA
AAAACAAGCAAAGAGTTTTGATCTTGTCTTCAAGAGGAATAACTTACAGGTATATAAATTTGTACCGATG
CGATGCAAAAAATCGCAGGAAAATGCTAACTCTACAACTTAGACATCGACATCTCCTCAATGACCTTGCG
TCCCTACTTCCCCACGGTAGGAAAGATGCGAAACTCGATACCAAGTCAAAGCTTTATCAATTGAATGAAT
>lcl|NW_001820830.1_gene_400 [locus_tag=SS1G_05227] [db_xref=GeneID:5489764] [partial=5',3'] [location=complement(<1032740..>1033620)] [gbkey=Gene]
ATGGCGGACGGATGTAAGTTAATTGATGTTCCTACTATTCCAGACTAATATTTGTTCTCGTCCCTACAAT
GCATTCGGAACGGATGGTACTCAGTTAACTTTGTAACTAATACAACGTCTAGTAAATGACCAAAGAACTG
我是python的新手,所以我想出了类似这样的东西:
results = []
f = open("test.txt", 'r')
while True:
line = f.readline()
if not line:
break
file_name = line.split("locus_tag")[-1].strip()
f.readline() # skip line
data_seq1 = f.readline().strip()
f.readline()
data_seq2 = f.readline().strip()
results.append((file_name, data_seq1))
我认为,解决此问题的最简单的方法是使用regex
例如以下示例:
import re
results = []
# Open the file in the 'read' mode
# with statement will take care to close the file
with open('YOUR_FILE_PATH', 'r') as f_file:
# Read the entire file as a one string
data = f_file.read()
# Here we search for the string that begins with '>lcl'
# and in which we find the [locus_tag=...] and [localtion=...]
results = re.findall(r'>lcl.*\[locus_tag=(.*?)\].*\[location=(.*?)\]', data)
for locus, location in results:
print(locus, location)
输出:
SS1G_08319 <504653..>506706
SS1G_12233 complement(<502136..>503461)
SS1G_02099 <2692251..>2693298
SS1G_05227 complement(<1032740..>1033620)
结果是使用dict
并分割行的另一种变化:
import re
results = {}
with open('fichier1', 'r') as f_file:
# Here we split the file's lines into a list
data = f_file.readlines()
for line in data:
# Here we search for the lines that begins by '>lcl'
# and same as the first attempt
results.update(re.findall(r'^>lcl.*\[locus_tag=(.*?)\].*\[location=(.*?)\]', line))
for locus, location in results.items():
print(locus, location)
编辑:创建一个DataFrame
并将其导出到一个csv
文件中:
import re
from pandas import DataFrame as df
results = {}
with open('fichier1', 'r') as f_file:
data = f_file.readlines()
for line in data:
results.update(re.findall(
r'^>lcl.*\[locus_tag=(.*?)\].*\[location=(.*?)\]',
line
))
df_ = df(
list(results.items()),
index=range(1, len(results) + 1),
columns=['locus', 'location']
)
print(df_)
df_.to_csv('results.csv', sep=',')
它将打印并创建一个名为results.csv
的文件:
locus location
1 SS1G_12233 complement(<502136..>503461)
2 SS1G_08319 <504653..>506706
3 SS1G_05227 complement(<1032740..>1033620)
4 SS1G_02099 <2692251..>2693298
我想提出两种替代解决方案。 一个将使用正则表达式提取行上的任何命名标签集,而另一个则是完整的琐事,但显示了一种无需正则表达式的方式。
通用正则表达式解决方案
import re
def get_tags(filename, tags, prefix='>lcl'):
tags = set(tags)
pattern = re.compile(r'\[(.+?)=(.+?)\]')
def parse_line(line):
return {m.group(1): m.group(2) for m in pattern.finditer(line) if m.group(1) in tags}
with open(filename) as f:
return [parse_line(line) for line in f if prefix is None or line.startswith(prefix)]
此函数返回由您感兴趣的标签作为关键字的字典列表,您将像这样使用它:
tags = ['locus_tag', 'location']
result = get_tags('test.txt', tags)
您可以使用结果来获取所需的确切打印输出:
for line in get_tags('test.txt', tags):
print(*(line[tag] for tag in tags))
这样的好处是您可以使用以后选择的结果,并配置要提取的标签。
没有正则表达式解决方案
这个版本只是我写来表明可能的。 请不要模仿它,因为代码是毫无意义的维护负担。
def get_tags2(filename, tags, prefix='>lcl'):
tags = set(tags)
def parse_line(line):
items = [tag.split(']')[0].split('=') for tag in line.split('[')[1:]]
return dict(tag for tag in items if tag[0] in tags)
with open(filename) as f:
return [parse_line(line) for line in data if prefix is None or line.startswith(prefix)]
该函数的行为与第一个函数相同,但是相比之下,解析函数是一个混乱的局面。 它的健壮性也要差得多,例如,因为假定您所有的方括号都差不多匹配。
这是显示两个方法的IDEOne链接: https ://ideone.com/X2LKqL
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.