[英]search files in python based on header and footer patterns
我想解析一个看起来像这样的文件:
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
HEADER
body
body
body
FOOTER
BLABLABLABLA
BLABLABLABLA
BLABLABLABLA
我想提取HEADER和FOOTER之间存在的内容。 每个HEADER和FOOTER之间的行数可以变化,因此内容本身也可以更改,因此我编写了以下代码来提取该内容:
fd=open(file,"r")
for line in fd:
if not start_flag:
match = re.search(r'.*HEADER.*',line)
if not match:
continue
else:
body=body+line+"\n"
start_flag=True
else:
match_end = re.search(r'.*FOOTER.*',line)
if not match_end:
body=body+line+"\n"
continue
else:
body=body+line+"\n\n"
break
print body
这是使用python从文件中提取内容的最佳方法吗? 解决此问题的其他方法有哪些?
from itertools import groupby
with open(f, "r") as fin:
groups = groupby(fin, key=lambda k:k.strip() in ("HEADER", "FOOTER"))
any(k for k,g in groups)
content = list(next(groups)[1])
print content
这是使用itertools
的方法:
from itertools import takewhile, dropwhile
with open("myfile.txt") as f:
starting_iterator = dropwhile(lambda x: x.strip() != 'HEADER', f)
next(starting_iterator, None)
contents = takewhile(lambda x: x.strip() != 'FOOTER', starting_iterator)
print list(contents)
自从我的评论被推后,我不妨展示一下我将如何做(无需在内存中建立列表,这就是迭代器的作用):
import itertools as it
def contents(source):
return it.takewhile(lambda x: "FOOTER" != x.strip(),
it.islice(
it.dropwhile(lambda x: "HEADER" != x.strip(), source),
1, None) )
with open("testfile") as f:
for line in contents(f):
# Do your stuff here....
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.