繁体   English   中英

如何在python中专注于列表的子集

[英]How can I focus on a subset of a list in python

我经常遇到这个问题,假设我有一个使用file.readlines()作为列表读取的文本文件。

假设文件看起来像这样

stuff stuff stuff stuff stuff
stuff stuff stuff stuff stuff
stuff stuff stuff stuff stuff
stuff stuff stuff stuff stuff #indeterminate number of line \
The text I want is set off by something distinctive
I want this
I want this
I want this
I want this # indeterminate number of lines
The end is also identifiable by something distinctive
stuff stuff stuff stuff stuff
stuff stuff stuff stuff stuff
stuff stuff stuff stuff stuff

我一直在处理这个问题的方法是做这样的事情

themasterlist=[]
for file in filelist:
    count=0
    templist=[]
    for line in file:
        if line=='The text I want is set off by something distinctive':
            count=1
        if line=='The end is also identifiable by something distinctive':
            count=0
        if count==1:
        templist.append(line)
   themasterlist.append(templist)

我曾考虑过使用字符串(file.read())并根据端点将其拆分,然后将其转换为列表,但实际上我想将此构造用于许多其他类型。 例如,假设我正在遍历lxml.fromstring(somefile)的元素,并且我想根据element.text是否包含某些短语等来处理元素的子集。

注意,我一次可以运行200K到300K文件。

我的解决方案有效,但感觉笨拙,就像我缺少有关python的重要内容

有三个非常好的答案,我从每个答案中学到了一些有用的东西。 我需要选择一个作为答案,但是我非常感谢每位海报的回复,这非常有帮助

我喜欢这样的东西:

def findblock( lines, start, stop ):
    it = iter(lines)
    for line in it:
        if start in line:
            # now we are in the block, so yield till we find the end
            for line in it:
                if stop in line:
                    # lets just look for one block
                    return # leave this generator
                    # break # would keep looking for the next block
                yield line                

for line in findblock(lines, start="something distinctive", 
                             stop="something distinctive"):
    print line

您缺少的是收益和列表理解-这是您的代码修改:

def findblock( lines, start='The text I want is set off by something distinctive', 
                      stop='The end is also identifiable by something distinctive'):
    for line in lines:
        inblock = False
        if line==start:
            inblock=True
        if line==stop:
            inblock=False # or return mb?
        if inblock:
            yield line

themasterlist = [list(findblock( file )) for file in files]

您可以执行以下操作:

data = list(filelist)
topindex = data.index('The text I want is set off by something distinctive')
endindex = data.index('The end is also identifiable by something distinctive')
themasterlist = data[topindex+1:endindex]

如果找不到您的独特文字,上述内容将引发异常,因此请为此做好准备。 还要注意,我确保data是列表,因为尽管有名称,但我不确定filelist是否是列表(它可能是迭代器)。

如果每个文件恰好有一个兴趣块,您可以

from itertools import dropwhile, takewhile
startline = "The text I want is set off by something distinctive"
endline = "The end is also identifiable by something distinctive"
masterlist = []
for file in filelist:
    next(dropwhile(lambda line: line != startline, file))
    masterlist.append(list(takewhile(lambda line: line != endline, file)))

如果每个文件中的块数未知,那么这会变得不太优雅:

for file in filelist:
    templist = []
    while True:
        try:
            next(dropwhile(lambda line: line != startline, file))
            masterlist += takewhile(lambda line: line != endline, file)
        except StopIteration:
            break
   masterlist.append(templist)

请注意,此代码假定filelist是打开的文件对象的列表。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM