简体   繁体   中英

How to parse varying amounts of lines from a text file in Python?

I'm trying to write a simple parser that grabs multiline blocks of text from one .txt file and copies it to a new .txt file. Why (I think) my problem differs from similar questions posted on-line is that the number of lines varies depending on the block of text, so I require some way of identifying where the desired block of text begins and ends.

Consider this minimal example of an input file:

NAME_1{a bunch of text|more text}
 1  -22.17
1 lol //
2 wtf //
NA_ME2{text|text}
 1  -25.50
1 gtfo //
NAME3{text|text}
 1  -17.50
1 brb //
2 lol //
3 wtf //

I want my parser to output a text file with NAME_1 with all its related information and NAME3 with all its related information. I want my output text file to read:

NAME_1{a bunch of text|more text}
 1  -22.17
1 lol //
2 wtf //
NAME3{text|text}
 1  -17.50
1 brb //
2 lol //
3 wtf //

I have a parser that works but is problematic (and inefficient, but I'm new to this). Specifically, the majority of the blocks of text I require are 43 lines in length, so my parser identifies a required name and then grabs that line and the next 42 lines of text. But this is a problem because some blocks of text are not 43 lines in length. This is what I have so far:

import re

infile = open('input.txt')
outfile = open('output.txt', 'w')

# Appends all needed names into a list
nameList = []
with open('list.txt') as f:
for name in f:
    n = name.strip()
    nameList.append(n)

# Finds required name from example txt file and outputs that line and the next 42   
lines = infile.readlines()
for line in range(0,len(lines)):
    for l in nameList:
        if l in lines[line]:
         [outfile.write(part) for part in lines[line:line+42]]

The list.txt file contains the following:

NAME_1{
NAME3{

I think regular expression could solve my problem. '([AZ]\\w+){' will locate the beginning of each block of text, so I imagine there must be some way to determine if the RE match is equivalent to an item of nameList , and then to parse every line until -- but not including -- the next match of '([AZ]\\w+){' . This way it shouldn't matter how long a block of text is. Is it possible to identifying where a desired block of text begins and ends using regular expressions in this way?

Thanks.

EDIT: Each block of text begins with the occurrence of the regular expression '([AZ]\\w+){' . Hence, the example input file contains three blocks of text where the lines with NAME_1, NA_ME2 and NAME3 represent the first line of each block.

Try this:

import re

s = """NAME_1{a bunch of text|more text}
 1  -22.17
1 lol //
2 wtf //
NA_ME2{text|text}
 1  -25.50
1 gtfo //
NAME3{text|text}
 1  -17.50
1 brb //
2 lol //
3 wtf //
"""

guards = ["NAME_1", "NAME3"]    
r = re.compile(r"^([A-Z][A-Z0-9_]+){")
printing = False

for line in s.splitlines():
    m = r.match(line)
    if m:
        if m.groups(1) and m.groups(1)[0] in guards:
            printing = True
        else:
            printing = False
    if printing:
        print(line.strip())

Output:

NAME_1{a bunch of text|more text}
1  -22.17
1 lol //
2 wtf //
NAME3{text|text}
1  -17.50
1 brb //
2 lol //
3 wtf //

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM