I'm trying to write a simple parser that grabs multiline blocks of text from one .txt file and copies it to a new .txt file. Why (I think) my problem differs from similar questions posted on-line is that the number of lines varies depending on the block of text, so I require some way of identifying where the desired block of text begins and ends.
Consider this minimal example of an input file:
NAME_1{a bunch of text|more text}
1 -22.17
1 lol //
2 wtf //
NA_ME2{text|text}
1 -25.50
1 gtfo //
NAME3{text|text}
1 -17.50
1 brb //
2 lol //
3 wtf //
I want my parser to output a text file with NAME_1 with all its related information and NAME3 with all its related information. I want my output text file to read:
NAME_1{a bunch of text|more text}
1 -22.17
1 lol //
2 wtf //
NAME3{text|text}
1 -17.50
1 brb //
2 lol //
3 wtf //
I have a parser that works but is problematic (and inefficient, but I'm new to this). Specifically, the majority of the blocks of text I require are 43 lines in length, so my parser identifies a required name and then grabs that line and the next 42 lines of text. But this is a problem because some blocks of text are not 43 lines in length. This is what I have so far:
import re
infile = open('input.txt')
outfile = open('output.txt', 'w')
# Appends all needed names into a list
nameList = []
with open('list.txt') as f:
for name in f:
n = name.strip()
nameList.append(n)
# Finds required name from example txt file and outputs that line and the next 42
lines = infile.readlines()
for line in range(0,len(lines)):
for l in nameList:
if l in lines[line]:
[outfile.write(part) for part in lines[line:line+42]]
The list.txt file contains the following:
NAME_1{
NAME3{
I think regular expression could solve my problem. '([AZ]\\w+){'
will locate the beginning of each block of text, so I imagine there must be some way to determine if the RE match is equivalent to an item of nameList
, and then to parse every line until -- but not including -- the next match of '([AZ]\\w+){'
. This way it shouldn't matter how long a block of text is. Is it possible to identifying where a desired block of text begins and ends using regular expressions in this way?
Thanks.
EDIT: Each block of text begins with the occurrence of the regular expression '([AZ]\\w+){'
. Hence, the example input file contains three blocks of text where the lines with NAME_1, NA_ME2 and NAME3 represent the first line of each block.
Try this:
import re
s = """NAME_1{a bunch of text|more text}
1 -22.17
1 lol //
2 wtf //
NA_ME2{text|text}
1 -25.50
1 gtfo //
NAME3{text|text}
1 -17.50
1 brb //
2 lol //
3 wtf //
"""
guards = ["NAME_1", "NAME3"]
r = re.compile(r"^([A-Z][A-Z0-9_]+){")
printing = False
for line in s.splitlines():
m = r.match(line)
if m:
if m.groups(1) and m.groups(1)[0] in guards:
printing = True
else:
printing = False
if printing:
print(line.strip())
Output:
NAME_1{a bunch of text|more text}
1 -22.17
1 lol //
2 wtf //
NAME3{text|text}
1 -17.50
1 brb //
2 lol //
3 wtf //
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.