简体   繁体   中英

How to find matching strings upto a specific string with regex in Python

I need to find specific strings in a file upto the line AUTO HEADER . I am not sure how to restrict the regex to find the matches only upto a specific line. Can someone help me figure that out?

This is my script:

import re
a = open("mod.txt", "r").read()
op = re.findall(r"type=(\w+)", a, re.MULTILINE)
print(op)

This is my input file mod.txt:

bla bla bla
header
module a
  (
 type=bye
 type=junk
 name=xyz type=getme
 type=new
  AUTO HEADER

type=dont_take_it
type=junk
type=new

Output:

['bye', 'junk', 'getme', 'new', 'dont_take_it', 'junk', 'new']

Expected output:

['bye', 'junk', 'getme', 'new']

In regex , I need to consider AUTO HEADER but not sure how exactly.

You can iterate each line in the txt file and exit when you find the required key

Ex:

import re
res = []
with open(filename) as infile:
    for line in infile:
        if "AUTO HEADER" in line:
            break
        op = re.search(r"type=(\w+)", line)
        if op:
            res.append(op.group(1))
            
print(res)  # --> ['bye', 'junk', 'getme', 'new']

You can use Positive Lookahead in regex together with re.DOTALL

op = re.findall(r"type=(\w+)(?=.*AUTO HEADER)", a, re.DOTALL)
print(op)

['bye', 'junk', 'getme', 'new']

(?=.*AUTO HEADER) Positive Lookahead to ensure any matching texts must be followed by the text AUTO HEADER somewhere after. Effectively exclude those unwanted matches after the text AUTO HEADER

re.DOTALL to allow the regex engine to look across lines (so that AUTO HEADER can be looked ahead).

I don't think regex is the best option here, but here's how it could be done anyhow.

You could do something like this:

[\s\S]*(?=AUTO HEADER)

Where \s will match on any whitespace character (space; tab; line break..) and \S - which is the opposite - will match anything that is not a whitespace character. The * will match all occurrences of the character set.

The (?=AUTO HEADER) is positive lookahead, it basically means match something after the main expression and don't include it in the result: 在此处输入图像描述

This may sound stupid but have you considered not supplying the full text to your Regex match but only the text up to your keyword? Like no reason to not just seperate it quickly before, no?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM