简体   繁体   English

什么是错误:在编译的Python正则表达式的回溯中没有重复的意思

[英]What does error: nothing to repeat mean in this traceback from a compiled Python regex

I have an interesting issue going on with trying to understand and improve my use of REGEX in Python 我有一个有趣的问题,试图理解和改进我在Python中使用REGEX

here is a regular expression 这是一个正则表达式

verbose_signature_pattern_2 = re.compile("""
^            # begin match at new line
\t*          # 0-or-more tab
[ ]*         # 0-or-more blankspaces
S            # capital S
[iI][gG][nN][aA][Tt][uU][rR][eE]
[sS]?        # 0-or-1 S
\s*          # 0-or-more whitespace
[^0-9]       # anything but [0-9]
$            # newline character
""", re.VERBOSE|re.MULTILINE)

When I run the code I get an error 当我运行代码时,我收到一个错误

""", re.VERBOSE|re.MULTILINE)
  File "C:\Python27\lib\re.py", line 190, in compile
return _compile(pattern, flags)
File "C:\Python27\lib\re.py", line 242, in _compile
raise error, v # invalid expression
error: nothing to repeat

if I get rid of the 0-or-more qualifier on the tab (\\t) special character it does not throw the error 如果我删除选项卡(\\ t)特殊字符上的0或更多限定符,则不会抛出错误

Ii am trying to find lines that have some variant of the word Signature on them as the first word in the line. 我正在尝试找到一些行,这些行有一些Signature一词的变体作为行中的第一个单词。 I know I could use a slightly different approach and get what I need. 我知道我可以使用稍微不同的方法来获得我需要的东西。 However I am imagining that the creator of the document might tab over to approximately center the word or they might use spaces. 但是,我想象文档的创建者可能会选中以大致居中这个词,或者他们可能会使用空格。 I do not want to use \\s because I do not want to capture all of the empty lines that could precede the line that has the word Signature. 我不想使用\\ s因为我不想捕获可能在具有单词Signature的行之前的所有空行。 Specifically I am trying to avoid capturing all of this crud 具体来说,我试图避免捕获所有这些问题

'\n\n\n\n            Signature    \n

I only want to see this in the output 我只想在输出中看到这个

'            Signature    \n

I do realize I can easily strip off the excess new-line characters but I am trying to understand and do things more precisely. 我确实意识到我可以轻松剥离多余的新行字符,但我正在努力理解并更精确地做事。 The interesting thing is that the following REGEX has the same start but it seems to be working as expected. 有趣的是,以下REGEX具有相同的开始,但它似乎按预期工作。 That is I am not getting an error when this one compiles and it seems to be giving me what I want - though I still need to find some more edge cases. 这就是我没有得到一个错误,当这个编译,它似乎给了我想要的东西 - 虽然我仍然需要找到更多的边缘情况。

verbose_item_pattern_2 = re.compile(r"""
^            # begin match at newline
\t*          # 0-or-more tabs
[ ]*         # 0-or-more blanks
I            # a capital I
[tT][eE][mM] # one character from each of the three sets this allows for unknown case
\t*          # 0-or-more tabs
[ ]*         # 0-or-more blanks
\d{1,2}      # 1-or-2 digits
[.]?         # 0-or-1 literal .
\(?          # 0-or-1 literal open paren
[a-e]?       # 0-or-1 letter in the range a-e
\)?          # 0-or-1 closing paren
.*           # any number of unknown characters so we can have words and punctuation
[^0-9]       # anything but [0-9]
$            # 1 newline character
""", re.VERBOSE|re.MULTILINE)

The first string is not a raw string. 第一个字符串不是原始字符串。 So when Python compiles the string (before it goes to the regex engine) it replaces all escape sequences. 因此,当Python编译字符串时(在它进入正则表达式引擎之前),它将替换所有转义序列。 So \\t will actually become a tab character in the string (not backslash-t). 所以\\t实际上会成为字符串中的制表符(而不是反斜杠-t)。 But you are using freespacing mode ( re.VERBOSE ). 但是你正在使用freespacing模式( re.VERBOSE )。 Therefore whitespace is insignificant. 因此,空白是微不足道的。 Your regex is equivalent to: 你的正则表达式相当于:

^*[ ]*S[iI][gG][nN][aA][Tt][uU][rR][eE][sS]?\s*[^0-9]$

\\s stays \\s , even in a non-raw string, because it is not a recognized escape-sequence in Python strings. 即使在非原始字符串中, \\s仍保持\\s ,因为它不是Python字符串中可识别的转义序列。

Then right at the beginning ^* is causing the problem, because you cannot repeat the anchor. 然后在开头^*导致问题,因为你不能重复锚。

This is why you should always use raw strings to write regular expressions. 这就是为什么你应该总是使用原始字符串来编写正则表达式。 Then \\t just stays backslash-t and the regex engine can interpret it as a tab. 然后\\t只保留反斜杠-t,正则表达式引擎可以将其解释为选项卡。

The space in [ ] is not a problem, by the way, since even in verbose/freespacing mode, spaces in character classes are significant. 顺便说一句, [ ]的空格不是问题,因为即使在详细/自由空间模式中,字符类中的空格也很重要。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM