简体   繁体   English

在Python中使用'for x in list'访问x + 1元素

[英]Accessing x+1 element with 'for x in list' in Python

I'm trying to parse a new line delimited text file into blocks of lines, which are appended to a .txt file. 我正在尝试将新行分隔的文本文件解析为行块,这些行附加到.txt文件。 I'd like to be able to grab x amount of lines AFTER my ending string, as these lines will vary in content, meaning setting the 'end string' to try to match it would miss lines. 我希望能够在结束字符串之后抓取x行数,因为这些行的内容会有所不同,这意味着设置'结束字符串'以尝试匹配它会错过行。

Example of file: 文件示例:

"Start"
"..."
"..."
"..."
"..."
"---" ##End here
"xxx" ##Unique data here
"xxx" ##And here

And here's the code 这是代码

first = "Start"
first_end = "---"

with open('testlog.log') as infile, open('parsed.txt', 'a') as outfile:
    copy = False
    for line in infile:
        if line.strip().startswith(first):
            copy = True
            outfile.write(line)
        elif line.strip().startswith(first_end):
            copy = False
            outfile.write(line)
            ##Want to also write next 2 lines here
        elif copy:
            outfile.write(line)

Is there any way to do this using for line in infile , or do I need to use a different type of loop? 是否有任何方法可以使用for line in infile ,或者我是否需要使用不同类型的循环?

You can use next or readline (in Python 3 and up) to retrieve the next line in the file: 您可以使用nextreadline (在Python 3及更高版本中)检索文件中的下一行:

    elif line.strip().startswith(first_end):
        copy = False
        outfile.write(line)
        outfile.write(next(infile))
        outfile.write(next(infile))

or 要么

    #note: not compatible with Python 2.7 and below
    elif line.strip().startswith(first_end):
        copy = False
        outfile.write(line)
        outfile.write(infile.readline())
        outfile.write(infile.readline())

This will also cause the file pointer to advance two additional lines, so the next iteration of for line in infile: will skip past the two lines you read with readline . 这也会导致文件指针前进两个额外的行,因此for line in infile:for line in infile:的下一次迭代将跳过你用readline读取的两行。


Bonus terminology nitpick: a file object is not a list, and methods for accessing the x+1th element of a list might not work for accessing the next line of a file, and vice versa. 奖励术语nitpick:文件对象不是列表,访问列表的第x + 1个元素的方法可能不适用于访问文件的下一行,反之亦然。 If you did want to access the next item of a proper list object, you could use enumerate so you can perform arithmetic on the list's index. 如果您确实想要访问正确列表对象的下一项,则可以使用enumerate以便可以对列表的索引执行算术运算。 For example: 例如:

seq = ["foo", "bar", "baz", "qux", "troz", "zort"]

#find all instances of "baz" and also the first two elements after "baz"
for idx, item in enumerate(seq):
    if item == "baz":
        print(item)
        print(seq[idx+1])
        print(seq[idx+2])

Note that, unlike readline , indexing will not advance the iterator, so for idx, item in enumerate(seq): will still iterate over "qux" and "troz". 请注意,与readline不同,索引不会推进迭代器,因此for idx, item in enumerate(seq):仍会迭代“qux”和“troz”。


An approach that works on any iterable is to use an additional variable to keep track of state across iterations. 适用于任何迭代的方法是使用附加变量来跟踪迭代中的状态。 The advantage of this is that you don't have to know anything about how to manually advance iterables; 这样做的好处是你不必知道如何手动推进迭代; the disadvantage is that reasoning about the logic within the loop is more difficult because it exposes an additional side-effect. 缺点是推理循环内的逻辑更加困难,因为它暴露了额外的副作用。

first = "Start"
first_end = "---"

with open('testlog.log') as infile, open('parsed.txt', 'a') as outfile:
    copy = False
    num_items_to_write = 0
    for line in infile:
        if num_items_to_write > 0:
            outfile.write(line)
            num_items_to_write -= 1
        elif line.strip().startswith(first):
            copy = True
            outfile.write(line)
        elif line.strip().startswith(first_end):
            copy = False
            outfile.write(line)
            num_items_to_write = 2
        elif copy:
            outfile.write(line)

In the specific case of pulling repetitive groups of data out of a delimited file, it might be appropriate to skip iteration entirely and use regex instead. 在从分隔文件中提取重复数据组的特定情况下,完全跳过迭代并使用正则表达式可能是合适的。 For data like yours, that might look like: 对于像您这样的数据,可能看起来像:

import re

with open("testlog.log") as file:
    data = file.read()

pattern = re.compile(r"""
^Start$                 #"Start" by itself on a line
(?:\n.*$)*?             #zero or more lines, matched non-greedily
                        #use (?:) for all groups so `findall` doesn't capture them later
\n---$                  #"---" by itself on a line
(?:\n.*$){2}            #exactly two lines
""", re.MULTILINE | re.VERBOSE)

#equivalent one-line regex:
#pattern = re.compile("^Start$(?:\n.*$)*?\n---$(?:\n.*$){2}", re.MULTILINE)

for group in pattern.findall(data):
    print("Found group:")
    print(group)
    print("End of group.\n\n")

When run on a log that looks like: 在日志上运行时看起来像:

Start
foo
bar
baz
qux
---
troz
zort
alice
bob
carol
dave
Start
Fred
Barney
---
Wilma
Betty
Pebbles

... This will produce the output: ...这将产生输出:

Found group:
Start
foo
bar
baz
qux
---
troz
zort
End of group.


Found group:
Start
Fred
Barney
---
Wilma
Betty
End of group.

easiest would be to make a generator function parsing the infile: 最简单的方法是使生成器函数解析infile:

def read_file(file_handle, start_line, end_line, extra_lines=2):
    start = False
    while True:
        try:
            line = next(file_handle)
        except StopIteration:
            return

        if not start and line.strip().startswith(start_line):
            start = True
            yield line
        elif not start:
            continue
        elif line.strip().startswith(end_line):
            yield line
            try:
                for _ in range(extra_lines):
                    yield next(file_handle)
            except StopIteration:
                return
        else:
            yield line

The try-except clauses would not be needed if you know each file is well-formed. 如果您知道每个文件格式正确,则不需要try-except子句。

You can use this generator like this: 您可以像这样使用此生成器:

if __name__ == "__main__":
    first = "Start"
    first_end = "---"

    with open("testlog.log") as infile, open("parsed.txt", "a") as outfile:
        output = read_file(
            file_handle=infile,
            start_line=first,
            end_line=first_end,
            extra_lines=1,
        )
        outfile.writelines(output)

A variation of @Kevin answer with a 3-state variable and less code duplication. @Kevin的变体回答了3状态变量和更少的代码重复。

first = "Start"
first_end = "---"
# Lines to read after end flag
extra_count = 2

with open('testlog.log') as infile, open('parsed.txt', 'a') as outfile:
    # Do no copy by default
    copy = 0

    for line in infile:
        # Strip once only
        clean_line = line.strip()

        # Enter "infinite copy" state
        if clean_line.startswith(first):
            copy = -1

        # Copy next line and extra amount
        elif clean_line.startswith(first_end):
            copy = extra_count + 1

        # If in a "must-copy" state
        if copy != 0:
            # One less line to copy if end flag passed
            if copy > 0:
                copy -= 1
            # Copy current line
            outfile.write(line)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM