如何在python中专注于列表的子集

Question

我经常遇到这个问题，假设我有一个使用file.readlines（）作为列表读取的文本文件。

假设文件看起来像这样

stuff stuff stuff stuff stuff
stuff stuff stuff stuff stuff
stuff stuff stuff stuff stuff
stuff stuff stuff stuff stuff #indeterminate number of line \
The text I want is set off by something distinctive
I want this
I want this
I want this
I want this # indeterminate number of lines
The end is also identifiable by something distinctive
stuff stuff stuff stuff stuff
stuff stuff stuff stuff stuff
stuff stuff stuff stuff stuff

我一直在处理这个问题的方法是做这样的事情

themasterlist=[]
for file in filelist:
    count=0
    templist=[]
    for line in file:
        if line=='The text I want is set off by something distinctive':
            count=1
        if line=='The end is also identifiable by something distinctive':
            count=0
        if count==1:
        templist.append(line)
   themasterlist.append(templist)

我曾考虑过使用字符串（file.read（））并根据端点将其拆分，然后将其转换为列表，但实际上我想将此构造用于许多其他类型。 例如，假设我正在遍历lxml.fromstring（somefile）的元素，并且我想根据element.text是否包含某些短语等来处理元素的子集。

注意，我一次可以运行200K到300K文件。

我的解决方案有效，但感觉笨拙，就像我缺少有关python的重要内容

有三个非常好的答案，我从每个答案中学到了一些有用的东西。 我需要选择一个作为答案，但是我非常感谢每位海报的回复，这非常有帮助

Answer 1

我喜欢这样的东西：

def findblock( lines, start, stop ):
    it = iter(lines)
    for line in it:
        if start in line:
            # now we are in the block, so yield till we find the end
            for line in it:
                if stop in line:
                    # lets just look for one block
                    return # leave this generator
                    # break # would keep looking for the next block
                yield line                

for line in findblock(lines, start="something distinctive", 
                             stop="something distinctive"):
    print line

您缺少的是收益和列表理解-这是您的代码修改：

def findblock( lines, start='The text I want is set off by something distinctive', 
                      stop='The end is also identifiable by something distinctive'):
    for line in lines:
        inblock = False
        if line==start:
            inblock=True
        if line==stop:
            inblock=False # or return mb?
        if inblock:
            yield line

themasterlist = [list(findblock( file )) for file in files]

Answer 2

您可以执行以下操作：

data = list(filelist)
topindex = data.index('The text I want is set off by something distinctive')
endindex = data.index('The end is also identifiable by something distinctive')
themasterlist = data[topindex+1:endindex]

如果找不到您的独特文字，上述内容将引发异常，因此请为此做好准备。 还要注意，我确保data是列表，因为尽管有名称，但我不确定filelist是否是列表（它可能是迭代器）。

Answer 3

如果每个文件恰好有一个兴趣块，您可以

from itertools import dropwhile, takewhile
startline = "The text I want is set off by something distinctive"
endline = "The end is also identifiable by something distinctive"
masterlist = []
for file in filelist:
    next(dropwhile(lambda line: line != startline, file))
    masterlist.append(list(takewhile(lambda line: line != endline, file)))

如果每个文件中的块数未知，那么这会变得不太优雅：

for file in filelist:
    templist = []
    while True:
        try:
            next(dropwhile(lambda line: line != startline, file))
            masterlist += takewhile(lambda line: line != endline, file)
        except StopIteration:
            break
   masterlist.append(templist)

请注意，此代码假定filelist是打开的文件对象的列表。

如何在python中专注于列表的子集

问题描述

3 个解决方案

解决方案1
4 已采纳 2011-02-20 22:57:44

解决方案2
2 2011-02-20 22:41:36

解决方案3
1 2011-02-20 22:54:22

如何在python中专注于列表的子集

问题描述

3 个解决方案

解决方案1 4 已采纳 2011-02-20 22:57:44

解决方案2 2 2011-02-20 22:41:36

解决方案3 1 2011-02-20 22:54:22

解决方案1
4 已采纳 2011-02-20 22:57:44

解决方案2
2 2011-02-20 22:41:36

解决方案3
1 2011-02-20 22:54:22