简体   繁体   English

需要Python Regex多个模式匹配和反模式匹配

[英]Python Regex Multiple Pattern Match and Anti Pattern Match needed

I have a python string, that has the following pattern of data. 我有一个python字符串,具有以下数据模式。

(a) be given by XXX for its election per scenario 3.3(a). (a)由XXX根据情况3.3(a)进行选举。 (b) Second statement has a section 4.2(b) which might elect for scenario 2.4(a) potentially (b)第二个陈述的第4.2(b)节可能选择了方案2.4(a)

I might have the above string of pattern which needs to be split by (a) through (z) but it should not split if (a) through (z) occurs in the middle of the statement and particularly the scenario XX(a) through XX(z) should not be selected. 我可能有上述模式字符串,需要用(a)到(z)分割,但是如果(a)到(z)出现在语句的中间,尤其是场景XX(a)到(z),则不应将其分割不应选择XX(z)。

I need it split as (2 lines) 我需要将其拆分为(2行)

  1. (a) be given by XXX for its election per scenario 3.3(a). (a)由XXX根据情况3.3(a)进行选举。

  2. (b) Second statement has a section 4.2(b) which might elect for scenario 2.4(a) potentially (b)第二个陈述的第4.2(b)节可能选择了方案2.4(a)

I am trying to pattern-match using python re 我正在尝试使用python re模式匹配

import re patterns=["[^0-9] (a) ","[^0-9] (b) ","[^0-9] (c) ","[^0-9] (d) "] import re pattern = [“ [^ 0-9](a)”,“ [^ 0-9](b)”,“ [^ 0-9](c)”,“ [^ 0-9](d )“]

textData="(a) be given by XXX for its election per scenario 3.3(a). (b) Second statement has a section 4.2(b) which might elect for scenario 2.4(a) potentially" regexPattern = '|'.join(map(re.escape, patterns)) splitList=re.split(regexPattern,textData) print(splitList) XXX根据情况3.3(a)选择textData =“(a)由XXX给出。(b)第二条语句有一个4.2(b)节,可能会为情况2.4(a)选择潜在的“ regexPattern ='|'.join (map(re.escape,patterns))splitList = re.split(regexPattern,textData)print(splitList)

This is the output Iam getting from executing 这是Iam从执行中得到的输出

['(a) be given by XXX for its election per scenario 3.3(a). ['(a)由XXX根据方案3.3(a)进行选举。 (b) Second statement has a section 4.2(b) which might elect for scenario 2.4(a) potentially'] (b)第二个陈述的第4.2(b)节可能会选择方案2.4(a)。

The space before and after the '.' “。”之前和之后的空间。 varies in previous section and a new section lets say (b) after previous section (a) begins in a new line. 在上一节中有所不同,在上一节(a)开始换行后,新节用(b)表示。

Although your requirements are a bit fuzzy, a reasonable shot given your particular input string seems to be to split on any space that is preceded by a literal . 尽管您的要求有点模糊,但考虑到您的特定输入字符串,合理的选择似乎是在任何以文字开头的空格处进行分割. and followed by the literal (letter) pattern. 然后是文字(letter)模式。

import re

s = "(a) be given by XXX for its election per scenario 3.3(a). (b) Second statement has a section 4.2(b) which might elect for scenario 2.4(a) potentially"

print(re.split(r"(?<=\.) (?=\([a-z]\))", s))

Output: 输出:

['(a) be given by XXX for its election per scenario 3.3(a).', 
 '(b) Second statement has a section 4.2(b) which might elect for scenario 2.4(a) potentially']

I'd caution using this on a large or complex input because the likelihood of false positives is high. 我建议在较大或复杂的输入上使用此方法,因为误报的可能性很高。


Another idea: if you are guaranteed to have every letter of the alphabet to extract, are sure each letter will show up eventually and in order, and want to treat anything out of order as normal content, you could try building a mammoth regex: 另一个想法:如果确保可以提取字母表中的每个字母,请确保每个字母最终都会按顺序显示,并且希望将任何乱序内容视为正常内容,则可以尝试构建一个庞大的正则表达式:

import re
from string import ascii_lowercase

s = "(a) be given by XXX for its election per scenario 3.3(a). (b) Second statement has a section 4.2(b) which might elect for scenario 2.4(a) potentially. (c) blah blah  (c) blah blah (d) asd ad(a) (b) (e) ee (b) (a) (d) (f) (f) fff f ff (g) (a) gggg (h) hhhh (b) (i) iii i i (i) i (j) jjj (k) k (l) ll (a) (b) (x) (m) mm (n) nn (o) oo) () () (p) ppp (A) (B) (Q) (q) qq (r) rr (s) ss (t) tt( u ) (u) uu (v) vvv (ww) (w) ww (x) xx (y) yy (z) zzz"

pattern = "".join([f"((?: |^)\({l}\) .+)" for l in ascii_lowercase])

for result in re.findall(pattern, s)[0]:
    print(result.strip())

Output: 输出:

(a) be given by XXX for its election per scenario 3.3(a).
(b) Second statement has a section 4.2(b) which might elect for scenario 2.4(a) potentially. (c) blah blah
(c) blah blah
(d) asd ad(a) (b)
(e) ee (b) (a) (d) (f)
(f) fff f ff
(g) (a) gggg
(h) hhhh (b) (i) iii i i
(i) i
(j) jjj
(k) k
(l) ll (a) (b) (x)
(m) mm
(n) nn
(o) oo) () ()
(p) ppp (A) (B) (Q)
(q) qq
(r) rr
(s) ss
(t) tt( u )
(u) uu
(v) vvv (ww)
(w) ww
(x) xx
(y) yy
(z) zzz

This still makes some sweeping assumptions about the input, but might be worth playing around with; 这仍然对输入有一些笼统的假设,但可能值得一试。 consider it a proof of concept. 认为这是概念证明。

Newlines are another issue to think about, if present (among many other things). 换行符是另一个需要考虑的问题(如果存在的话)。 Long story short, writing a parser by hand might be a better bet than regex. 长话短说,手动编写解析器可能比regex更好。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM