简体   繁体   English

如何在 Python 中使用正则表达式查找与特定字符串匹配的字符串

[英]How to find matching strings upto a specific string with regex in Python

I need to find specific strings in a file upto the line AUTO HEADER .我需要在文件中找到特定的字符串,直到AUTO HEADER行。 I am not sure how to restrict the regex to find the matches only upto a specific line.我不确定如何限制regex以仅查找特定行的匹配项。 Can someone help me figure that out?有人可以帮我弄清楚吗?

This is my script:这是我的脚本:

import re
a = open("mod.txt", "r").read()
op = re.findall(r"type=(\w+)", a, re.MULTILINE)
print(op)

This is my input file mod.txt:这是我的输入文件 mod.txt:

bla bla bla
header
module a
  (
 type=bye
 type=junk
 name=xyz type=getme
 type=new
  AUTO HEADER

type=dont_take_it
type=junk
type=new

Output: Output:

['bye', 'junk', 'getme', 'new', 'dont_take_it', 'junk', 'new']

Expected output:预期 output:

['bye', 'junk', 'getme', 'new']

In regex , I need to consider AUTO HEADER but not sure how exactly.regex中,我需要考虑AUTO HEADER但不确定具体如何。

You can iterate each line in the txt file and exit when you find the required key可以遍历txt文件中的每一行,找到需要的key就退出

Ex:前任:

import re
res = []
with open(filename) as infile:
    for line in infile:
        if "AUTO HEADER" in line:
            break
        op = re.search(r"type=(\w+)", line)
        if op:
            res.append(op.group(1))
            
print(res)  # --> ['bye', 'junk', 'getme', 'new']

You can use Positive Lookahead in regex together with re.DOTALL您可以在正则表达式中与 re.DOTALL 一起使用 Positive Lookahead

op = re.findall(r"type=(\w+)(?=.*AUTO HEADER)", a, re.DOTALL)
print(op)

['bye', 'junk', 'getme', 'new']

(?=.*AUTO HEADER) Positive Lookahead to ensure any matching texts must be followed by the text AUTO HEADER somewhere after. (?=.*AUTO HEADER)正向预测以确保任何匹配的文本后面必须跟文本AUTO HEADER Effectively exclude those unwanted matches after the text AUTO HEADER在文本AUTO HEADER之后有效地排除那些不需要的匹配项

re.DOTALL to allow the regex engine to look across lines (so that AUTO HEADER can be looked ahead). re.DOTALL允许正则表达式引擎跨行查看(以便可以向前查看AUTO HEADER )。

I don't think regex is the best option here, but here's how it could be done anyhow.我不认为正则表达式是这里的最佳选择,但无论如何都可以这样做。

You could do something like this:你可以这样做:

[\s\S]*(?=AUTO HEADER)

Where \s will match on any whitespace character (space; tab; line break..) and \S - which is the opposite - will match anything that is not a whitespace character.其中\s将匹配任何空白字符(空格;制表符;换行符..),而\S - 相反 - 将匹配任何非空白字符。 The * will match all occurrences of the character set. *将匹配所有出现的字符集。

The (?=AUTO HEADER) is positive lookahead, it basically means match something after the main expression and don't include it in the result: (?=AUTO HEADER)是积极的前瞻,它基本上意味着在主表达式之后匹配一些东西并且不将其包含在结果中: 在此处输入图像描述

This may sound stupid but have you considered not supplying the full text to your Regex match but only the text up to your keyword?这可能听起来很愚蠢,但您是否考虑过不为您的正则表达式匹配提供全文,而只提供与您的关键字匹配的文本? Like no reason to not just seperate it quickly before, no?就像没有理由不只是在之前快速分开它,不是吗?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM