简体   繁体   English

在python中具有多个匹配和负面条件的多行正则表达式

[英]Multi-line regex with multiple matches and negative conditions in python

I am reading a text file and attempting to capture one of the arguments of each distinct tag, which has not been commented out. 我正在读取一个文本文件并尝试捕获每个不同标记的一个参数,这些参数尚未被注释掉。

More specifically, I have the following input... 更具体地说,我有以下输入......

maybe there is some text \THISTAG[arg1=1,argtwo]{WANT0}
% \THISTAG[arg1=1,argtwo]{NOTWANT}
% blah blah \THISTAG[arg1=1,argtwo]{NOTWANT}
\THISTAG[arg1=1,argtwo]{WANT1}\THISTAG[arg1=1,argtwo]{WANT2}\\stuff
\sometag{stuff I don't want}[{\THISTAG[arg1=1,argtwo]{WANT3}}]{more stuff I don't want}
\THISTAG[arg1=1,argtwo]{OBV_WANT}

I want the following output 我想要以下输出

WANT0
WANT1
WANT2
WANT3
OBV_WANT

So far I have the following code, which doesn't accomplish what I want 到目前为止,我有以下代码,但没有达到我想要的效果

with open(target, "r") as ins:
    f = re.findall(r'^(?:[^%])?\\THISTAG\[.+\]{(.+?)}(?:{.+})?', ins.read(),re.MULTILINE)

You could do the regex line by line with filtering out the ones that start with % : 您可以逐行执行正则表达式,并过滤掉以%开头的那些:

import re

res = []
with open('test.txt') as f:
    res = sum([re.findall('\\THISTAG\[.*?\]{(.*?)}', line) 
               for line in f if not line.startswith('%')
              ], [])

    print res # ['WANT0', 'WANT1', 'WANT2', 'WANT3', 'OBV_WANT']

Try this 试试这个

^%.*|\\THISTAG[^{]+{([^}]+)}

Regex demo 正则表达式演示

Explanation: 说明:
^ : Start of string or start of line depending on multiline mode sample ^ :根据多行模式样本开始字符串或开始行
. : Any character except line break sample :除了换行符的任何字符样本
* : Zero or more times sample *样品零次或多次
| : Alternation / OR operand sample :Alternation / OR操作数示例
\\ : Escapes a special character sample \\ :逃避特殊字符样本
[^x] : One character that is not x sample [^x] :一个不是x sample的字符
+ : One or more sample + :一个或多个样本
( … )`: Capturing group sample (...)`:捕获小组样本

import re
p = re.compile(ur'^%.*|\\THISTAG[^{]+{([^}]+)}', re.MULTILINE)
test_str = u"maybe there is some text \THISTAG[arg1=1,argtwo]{WANT0}\n% \THISTAG[arg1=1,argtwo]{NOTWANT}\n% blah blah \THISTAG[arg1=1,argtwo]{NOTWANT}\n\THISTAG[arg1=1,argtwo]{WANT1}\THISTAG[arg1=1,argtwo]{WANT2}\\stuff\n\sometag{stuff I don't want}[{\THISTAG[arg1=1,argtwo]{WANT3}}]{more stuff I don't want}\n\THISTAG[arg1=1,argtwo]{OBV_WANT}"

g = re.findall(p, test_str)
for m in g:
    if m:
        print m

Output: 输出:

WANT0
WANT1
WANT2
WANT3
OBV_WANT

So here's your regex shortened up a little bit: 所以这里你的正则表达式缩短了一点:

re.findall(r'\\THISTAG\[.+?\]{([^N].+?)}', a,re.MULTILINE)

The important part is here: 重要的是这里:

{([^N].+?)}

Where I have [^N] is where you need to make your distinction between what you want and don't want. 我有[^N]的地方,你需要区分你想要和不想要的东西。 With the arguments you've given, I get this output: 根据你给出的参数,我得到了这个输出:

>>> print(a)
\THISTAG[arg1=1,argtwo]{WANT0}
% \THISTAG[arg1=1,argtwo]{NOTWANT}
% blah blah \THISTAG[arg1=1,argtwo]{NOTWANT}
\THISTAG[arg1=1,argtwo]{WANT1}\THISTAG[arg1=1,argtwo]{WANT2}\stuff
\sometag{stuff I don't want}[{\THISTAG[arg1=1,argtwo]{WANT3}}]{more stuff I don'    t want}
\THISTAG[arg1=1,argtwo]{OBV_WANT}
>>>
>>> re.findall(r'\\THISTAG\[.+?\]{([^N].+?)}', a,re.MULTILINE)
['WANT0', 'WANT1', 'WANT2', 'WANT3', 'OBV_WANT']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM