How to find matching strings upto a specific string with regex in Python

Question

I need to find specific strings in a file upto the line AUTO HEADER . I am not sure how to restrict the regex to find the matches only upto a specific line. Can someone help me figure that out?

This is my script:

import re
a = open("mod.txt", "r").read()
op = re.findall(r"type=(\w+)", a, re.MULTILINE)
print(op)

This is my input file mod.txt:

bla bla bla
header
module a
  (
 type=bye
 type=junk
 name=xyz type=getme
 type=new
  AUTO HEADER

type=dont_take_it
type=junk
type=new

Output:

['bye', 'junk', 'getme', 'new', 'dont_take_it', 'junk', 'new']

Expected output:

['bye', 'junk', 'getme', 'new']

In regex , I need to consider AUTO HEADER but not sure how exactly.

Answer 1

You can iterate each line in the txt file and exit when you find the required key

Ex:

import re
res = []
with open(filename) as infile:
    for line in infile:
        if "AUTO HEADER" in line:
            break
        op = re.search(r"type=(\w+)", line)
        if op:
            res.append(op.group(1))
            
print(res)  # --> ['bye', 'junk', 'getme', 'new']

Answer 2

You can use Positive Lookahead in regex together with re.DOTALL

op = re.findall(r"type=(\w+)(?=.*AUTO HEADER)", a, re.DOTALL)
print(op)

['bye', 'junk', 'getme', 'new']

(?=.*AUTO HEADER) Positive Lookahead to ensure any matching texts must be followed by the text AUTO HEADER somewhere after. Effectively exclude those unwanted matches after the text AUTO HEADER

re.DOTALL to allow the regex engine to look across lines (so that AUTO HEADER can be looked ahead).

Answer 3

I don't think regex is the best option here, but here's how it could be done anyhow.

You could do something like this:

[\s\S]*(?=AUTO HEADER)

Where \s will match on any whitespace character (space; tab; line break..) and \S - which is the opposite - will match anything that is not a whitespace character. The * will match all occurrences of the character set.

The (?=AUTO HEADER) is positive lookahead, it basically means match something after the main expression and don't include it in the result:

Answer 4

This may sound stupid but have you considered not supplying the full text to your Regex match but only the text up to your keyword? Like no reason to not just seperate it quickly before, no?

How to find matching strings upto a specific string with regex in Python

Question

4 answers

solution1
3 2021-03-25 07:30:33

solution2
2 ACCPTED 2021-03-25 07:45:28

solution3
1 2021-03-25 07:30:46

solution4
0 2021-03-25 07:21:29

How to find matching strings upto a specific string with regex in Python

Question

4 answers

solution1 3 2021-03-25 07:30:33

solution2 2 ACCPTED 2021-03-25 07:45:28

solution3 1 2021-03-25 07:30:46

solution4 0 2021-03-25 07:21:29

solution1
3 2021-03-25 07:30:33

solution2
2 ACCPTED 2021-03-25 07:45:28

solution3
1 2021-03-25 07:30:46

solution4
0 2021-03-25 07:21:29