[英]How can I focus on a subset of a list in python
我经常遇到这个问题,假设我有一个使用file.readlines()作为列表读取的文本文件。
假设文件看起来像这样
stuff stuff stuff stuff stuff
stuff stuff stuff stuff stuff
stuff stuff stuff stuff stuff
stuff stuff stuff stuff stuff #indeterminate number of line \
The text I want is set off by something distinctive
I want this
I want this
I want this
I want this # indeterminate number of lines
The end is also identifiable by something distinctive
stuff stuff stuff stuff stuff
stuff stuff stuff stuff stuff
stuff stuff stuff stuff stuff
我一直在处理这个问题的方法是做这样的事情
themasterlist=[]
for file in filelist:
count=0
templist=[]
for line in file:
if line=='The text I want is set off by something distinctive':
count=1
if line=='The end is also identifiable by something distinctive':
count=0
if count==1:
templist.append(line)
themasterlist.append(templist)
我曾考虑过使用字符串(file.read())并根据端点将其拆分,然后将其转换为列表,但实际上我想将此构造用于许多其他类型。 例如,假设我正在遍历lxml.fromstring(somefile)的元素,并且我想根据element.text是否包含某些短语等来处理元素的子集。
注意,我一次可以运行200K到300K文件。
我的解决方案有效,但感觉笨拙,就像我缺少有关python的重要内容
有三个非常好的答案,我从每个答案中学到了一些有用的东西。 我需要选择一个作为答案,但是我非常感谢每位海报的回复,这非常有帮助
我喜欢这样的东西:
def findblock( lines, start, stop ):
it = iter(lines)
for line in it:
if start in line:
# now we are in the block, so yield till we find the end
for line in it:
if stop in line:
# lets just look for one block
return # leave this generator
# break # would keep looking for the next block
yield line
for line in findblock(lines, start="something distinctive",
stop="something distinctive"):
print line
您缺少的是收益和列表理解-这是您的代码修改:
def findblock( lines, start='The text I want is set off by something distinctive',
stop='The end is also identifiable by something distinctive'):
for line in lines:
inblock = False
if line==start:
inblock=True
if line==stop:
inblock=False # or return mb?
if inblock:
yield line
themasterlist = [list(findblock( file )) for file in files]
您可以执行以下操作:
data = list(filelist)
topindex = data.index('The text I want is set off by something distinctive')
endindex = data.index('The end is also identifiable by something distinctive')
themasterlist = data[topindex+1:endindex]
如果找不到您的独特文字,上述内容将引发异常,因此请为此做好准备。 还要注意,我确保data
是列表,因为尽管有名称,但我不确定filelist
是否是列表(它可能是迭代器)。
如果每个文件恰好有一个兴趣块,您可以
from itertools import dropwhile, takewhile
startline = "The text I want is set off by something distinctive"
endline = "The end is also identifiable by something distinctive"
masterlist = []
for file in filelist:
next(dropwhile(lambda line: line != startline, file))
masterlist.append(list(takewhile(lambda line: line != endline, file)))
如果每个文件中的块数未知,那么这会变得不太优雅:
for file in filelist:
templist = []
while True:
try:
next(dropwhile(lambda line: line != startline, file))
masterlist += takewhile(lambda line: line != endline, file)
except StopIteration:
break
masterlist.append(templist)
请注意,此代码假定filelist
是打开的文件对象的列表。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.