[英]How to find matching strings upto a specific string with regex in Python
I need to find specific strings in a file upto the line AUTO HEADER
.我需要在文件中找到特定的字符串,直到AUTO HEADER
行。 I am not sure how to restrict the regex
to find the matches only upto a specific line.我不确定如何限制regex
以仅查找特定行的匹配项。 Can someone help me figure that out?有人可以帮我弄清楚吗?
This is my script:这是我的脚本:
import re
a = open("mod.txt", "r").read()
op = re.findall(r"type=(\w+)", a, re.MULTILINE)
print(op)
This is my input file mod.txt:这是我的输入文件 mod.txt:
bla bla bla
header
module a
(
type=bye
type=junk
name=xyz type=getme
type=new
AUTO HEADER
type=dont_take_it
type=junk
type=new
Output: Output:
['bye', 'junk', 'getme', 'new', 'dont_take_it', 'junk', 'new']
Expected output:预期 output:
['bye', 'junk', 'getme', 'new']
In regex
, I need to consider AUTO HEADER
but not sure how exactly.在regex
中,我需要考虑AUTO HEADER
但不确定具体如何。
You can iterate each line in the txt file and exit when you find the required key可以遍历txt文件中的每一行,找到需要的key就退出
Ex:前任:
import re
res = []
with open(filename) as infile:
for line in infile:
if "AUTO HEADER" in line:
break
op = re.search(r"type=(\w+)", line)
if op:
res.append(op.group(1))
print(res) # --> ['bye', 'junk', 'getme', 'new']
You can use Positive Lookahead in regex together with re.DOTALL您可以在正则表达式中与 re.DOTALL 一起使用 Positive Lookahead
op = re.findall(r"type=(\w+)(?=.*AUTO HEADER)", a, re.DOTALL)
print(op)
['bye', 'junk', 'getme', 'new']
(?=.*AUTO HEADER)
Positive Lookahead to ensure any matching texts must be followed by the text AUTO HEADER
somewhere after. (?=.*AUTO HEADER)
正向预测以确保任何匹配的文本后面必须跟文本AUTO HEADER
。 Effectively exclude those unwanted matches after the text AUTO HEADER
在文本AUTO HEADER
之后有效地排除那些不需要的匹配项
re.DOTALL
to allow the regex engine to look across lines (so that AUTO HEADER
can be looked ahead). re.DOTALL
允许正则表达式引擎跨行查看(以便可以向前查看AUTO HEADER
)。
I don't think regex is the best option here, but here's how it could be done anyhow.我不认为正则表达式是这里的最佳选择,但无论如何都可以这样做。
You could do something like this:你可以这样做:
[\s\S]*(?=AUTO HEADER)
Where \s
will match on any whitespace character (space; tab; line break..) and \S
- which is the opposite - will match anything that is not a whitespace character.其中\s
将匹配任何空白字符(空格;制表符;换行符..),而\S
- 相反 - 将匹配任何非空白字符。 The *
will match all occurrences of the character set. *
将匹配所有出现的字符集。
The (?=AUTO HEADER)
is positive lookahead, it basically means match something after the main expression and don't include it in the result: (?=AUTO HEADER)
是积极的前瞻,它基本上意味着在主表达式之后匹配一些东西并且不将其包含在结果中:
This may sound stupid but have you considered not supplying the full text to your Regex match but only the text up to your keyword?这可能听起来很愚蠢,但您是否考虑过不为您的正则表达式匹配提供全文,而只提供与您的关键字匹配的文本? Like no reason to not just seperate it quickly before, no?就像没有理由不只是在之前快速分开它,不是吗?
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.