简体   繁体   English

模式意外结束:Python 正则表达式

[英]Unexpected end of Pattern : Python Regex

When I use the following python regex to perform the functionality described below, I get the error Unexpected end of Pattern.当我使用以下 python 正则表达式执行下述功能时,我收到错误 Unexpected end of Pattern。

Regex:正则表达式:

modified=re.sub(r'^(?i)((?:(?!http://)(?!testing[0-9])(?!example[0-9]).)*?)(?-i)
(CODE[0-9]{3})(?!</a>)',r'<a href="http://productcode/\g<1>">\g<1></a>',input)

Purpose of this regex:此正则表达式的目的:

INPUT:输入:

CODE876
CODE223
matchjustCODE657
CODE69743
code876
testing1CODE888
example2CODE098
http://replaced/CODE665

Should match:应该匹配:

CODE876
CODE223
CODE657
CODE697

and replace occurrences with并将出现替换为

http://productcode/CODE876
http://productcode/CODE223
matchjusthttp://productcode/CODE657
http://productcode/CODE69743

Should Not match:不应该匹配:

code876
testing1CODE888
testing2CODE776
example3CODE654
example2CODE098
http://replaced/CODE665

FINAL OUTPUT最终 OUTPUT

http://productcode/CODE876
http://productcode/CODE223
matchjusthttp://productcode/CODE657
http://productcode/CODE69743
code876
testing1CODE888
example2CODE098
http://replaced/CODE665

EDIT and UPDATE 1编辑和更新 1

modified=re.sub(r'^(?i)((?:(?!http://)(?!testing[0-9])(?!example[0-9]).)*?)(CODE[0-9]{3})(?!</a>)',r'<a href="http://productcode/\g<1>">\g<1></a>',input)

The error is no more happening.错误不再发生。 But this does not match any of the patterns as needed.但这与需要的任何模式都不匹配。 Is there a problem with matching groups or the matching itself.匹配组或匹配本身是否存在问题。 Because when I compile this regex as such, I get no match to my input.因为当我这样编译这个正则表达式时,我的输入不匹配。

EDIT AND UPDATE 2编辑和更新 2

f=open("/Users/mymac/Desktop/regex.txt")
s=f.read()

s1 = re.sub(r'((?!http://|testing[0-9]|example[0-9]).*?)(CODE[0-9]{3})(?!</a>)', 
            r'\g<1><a href="http://productcode/\g<2>">\g<2></a>', s)
print s1

INPUT输入

CODE123 CODE765 testing1CODE123 example1CODE345 http://www.coding.com/CODE333 CODE345

CODE234

CODE333

OUTPUT OUTPUT

<a href="http://productcode/CODE123">CODE123</a> <a href="http://productcode/CODE765">CODE765</a> testing1<a href="http://productcode/CODE123">CODE123</a> example1<a href="http://productcode/CODE345">CODE345</a> http://www.coding.com/<a href="http://productcode/CODE333">CODE333</a> <a href="http://productcode/CODE345">CODE345</a>

<a href="http://productcode/CODE234">CODE234</a>

<a href="http://productcode/CODE333">CODE333</a>

Regex works for Raw input, but not for string input from a text file.正则表达式适用于原始输入,但不适用于来自文本文件的字符串输入。

See Input 4 and 5 for more results http://ideone.com/3w1E3有关更多结果,请参见输入 4 和 5 http://ideone.com/3w1E3

Your main problem is the (?-i) thingy which is wishful thinking as far as Python 2.7 and 3.2 are concerned.您的主要问题是(?-i)事情,就 Python 2.7 和 3.2 而言,这是一厢情愿的想法。 For more details, see below.有关更多详细信息,请参见下文。

import re
# modified=re.sub(r'^(?i)((?:(?!http://)(?!testing[0-9])(?!example[0-9]).)*?)(?-i)
# (CODE[0-9]{3})(?!</a>)',r'<a href="http://productcode/\g<1>">\g<1></a>',input)
# observation 1: as presented, pattern has a line break in the middle, just after (?-i)
# ob 2: rather hard to read, should use re.VERBOSE
# ob 3: not obvious whether it's a complile-time or run-time problem
# ob 4: (?i) should be at the very start of the pattern (see docs)
# ob 5: what on earth is (?-i) ... not in 2.7 docs, not in 3.2 docs
pattern = r'^(?i)((?:(?!http://)(?!testing[0-9])(?!example[0-9]).)*?)(?-i)(CODE[0-9]{3})(?!</a>)'
#### rx = re.compile(pattern)
# above line failed with "sre_constants.error: unexpected end of pattern"
# try without the (?-i)
pattern2 = r'^(?i)((?:(?!http://)(?!testing[0-9])(?!example[0-9]).)*?)(CODE[0-9]{3})(?!</a>)'
rx = re.compile(pattern2)
# This works, now you need to work on observations 1 to 4,
# and rethink your CODE/code strategy

Looks like suggestions fall on deaf ears... Here's the pattern in re.VERBOSE format:看起来建议被置若罔闻......这是 re.VERBOSE 格式的模式:

pattern4 = r'''
    ^
    (?i)
    (
        (?:
            (?!http://)
            (?!testing[0-9])
            (?!example[0-9])
            . #### what is this for?
        )*?
    ) ##### end of capturing group 1
    (CODE[0-9]{3}) #### not in capturing group 1
    (?!</a>)
    '''

Okay, it looks like the problem is the (?-i) , which is surprising.好的,看起来问题出在(?-i)上,这令人惊讶。 The purpose of the inline-modifier syntax is to let you apply modifiers to selected portions of the regex. inline-modifier 语法的目的是让您将修饰符应用于正则表达式的选定部分。 At least, that's how they work in most flavors.至少,这就是它们在大多数口味中的工作方式。 In Python it seems they always modify the whole regex, same as the external flags ( re.I , re.M , etc.).在 Python 中,他们似乎总是修改整个正则表达式,与外部标志( re.Ire.M等)相同。 The alternative (?i:xyz) syntax doesn't work either.替代(?i:xyz)语法也不起作用。

On a side note, I don't see any reason to use three separate lookaheads, as you did here:附带说明一下,我认为没有任何理由使用三个单独的前瞻,就像您在此处所做的那样:

(?:(?!http://)(?!testing[0-9])(?!example[0-9]).)*?

Just OR them together:将它们组合在一起:

(?:(?!http://|testing[0-9]|example[0-9]).)*?

EDIT: We seem to have moved from the question of why the regex throws exceptions, to the question of why it doesn't work.编辑:我们似乎已经从正则表达式为什么抛出异常的问题转向了它为什么不起作用的问题。 I'm not sure I understand your requirements, but the regex and replacement string below return the results you want.我不确定我是否理解您的要求,但下面的正则表达式和替换字符串会返回您想要的结果。

s1 = re.sub(r'^((?!http://|testing[0-9]|example[0-9]).*?)(CODE[0-9]{3})(?!</a>)', 
            r'\g<1><a href="http://productcode/\g<2>">\g<2></a>', s)

see it in action one ideone.com看看它在行动一个ideone.com

Is that what you're after?这就是你所追求的吗?


EDIT: We now know that the replacements are being done within a larger text, not on standalone strings.编辑:我们现在知道替换是在更大的文本中完成的,而不是在独立的字符串上。 That's makes the problem much more difficult, but we also know the full URLs (the ones that start with http:// ) only occur in already-existing anchor elements.这使问题变得更加困难,但我们也知道完整的 URL(以http://开头的 URL)只出现在已经存在的锚元素中。 That means we can split the regex into two alternatives: one to match complete <a>...</a> elements, and one to match our the target strings.这意味着我们可以将正则表达式分成两种选择:一种匹配完整的<a>...</a>元素,另一种匹配我们的目标字符串。

(?s)(?:(<a\s+[^>]*>.*?</a>)|\b((?:(?!testing[0-9]|example[0-9])\w)*?)(CODE[0-9]{3}))

The trick is to use a function instead of a static string for the replacement.诀窍是使用 function 而不是 static 字符串进行替换。 Whenever the regex matches an anchor element, the function will find it in group(1) and return it unchanged.每当正则表达式匹配一个锚元素时,function 就会在 group(1) 中找到它并原封不动地返回它。 Otherwise, it uses group(2) and group(3) to build a new one.否则,它使用 group(2) 和 group(3) 来构建一个新的。

here's another demo (I know that's horrible code, but I'm too tired right now to learn a more pythonic way.)这是另一个演示(我知道那是可怕的代码,但我现在太累了,无法学习更 Python 的方式。)

The only problem I see is that you replace using the wrong capturing group.我看到的唯一问题是您使用错误的捕获组替换。

modified=re.sub(r'^(?i)((?:(?!http://)(?!testing[0-9])(?!example[0-9]).)*?)(?-i)(CODE[0-9]{3})(?!</a>)',r'<a href="http://productcode/\g<1>">\g<1></a>',input)  
                       ^                                                        ^                                                        ^
                    first capturing group                                  second one                                         using the first group

Here I made the first one also a non capturing group在这里,我将第一个也设为非捕获组

^(?i)(?:(?:(?!http://)(?!testing[0-9])(?!example[0-9]).)*?)(?-i)(CODE[0-9]{3})(?!</a>)

See it here on Regexr在 Regexr 上查看

For complex regexes, use the re.X flag to document what you're doing and to make sure the brackets match up correctly (ie by using indentation to indicate the current level of nesting).对于复杂的正则表达式,使用re.X 标志来记录您正在做什么并确保括号正确匹配(即通过使用缩进来指示当前的嵌套级别)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM