简体   繁体   中英

What does error: nothing to repeat mean in this traceback from a compiled Python regex

I have an interesting issue going on with trying to understand and improve my use of REGEX in Python

here is a regular expression

verbose_signature_pattern_2 = re.compile("""
^            # begin match at new line
\t*          # 0-or-more tab
[ ]*         # 0-or-more blankspaces
S            # capital S
[iI][gG][nN][aA][Tt][uU][rR][eE]
[sS]?        # 0-or-1 S
\s*          # 0-or-more whitespace
[^0-9]       # anything but [0-9]
$            # newline character
""", re.VERBOSE|re.MULTILINE)

When I run the code I get an error

""", re.VERBOSE|re.MULTILINE)
  File "C:\Python27\lib\re.py", line 190, in compile
return _compile(pattern, flags)
File "C:\Python27\lib\re.py", line 242, in _compile
raise error, v # invalid expression
error: nothing to repeat

if I get rid of the 0-or-more qualifier on the tab (\\t) special character it does not throw the error

Ii am trying to find lines that have some variant of the word Signature on them as the first word in the line. I know I could use a slightly different approach and get what I need. However I am imagining that the creator of the document might tab over to approximately center the word or they might use spaces. I do not want to use \\s because I do not want to capture all of the empty lines that could precede the line that has the word Signature. Specifically I am trying to avoid capturing all of this crud

'\n\n\n\n            Signature    \n

I only want to see this in the output

'            Signature    \n

I do realize I can easily strip off the excess new-line characters but I am trying to understand and do things more precisely. The interesting thing is that the following REGEX has the same start but it seems to be working as expected. That is I am not getting an error when this one compiles and it seems to be giving me what I want - though I still need to find some more edge cases.

verbose_item_pattern_2 = re.compile(r"""
^            # begin match at newline
\t*          # 0-or-more tabs
[ ]*         # 0-or-more blanks
I            # a capital I
[tT][eE][mM] # one character from each of the three sets this allows for unknown case
\t*          # 0-or-more tabs
[ ]*         # 0-or-more blanks
\d{1,2}      # 1-or-2 digits
[.]?         # 0-or-1 literal .
\(?          # 0-or-1 literal open paren
[a-e]?       # 0-or-1 letter in the range a-e
\)?          # 0-or-1 closing paren
.*           # any number of unknown characters so we can have words and punctuation
[^0-9]       # anything but [0-9]
$            # 1 newline character
""", re.VERBOSE|re.MULTILINE)

The first string is not a raw string. So when Python compiles the string (before it goes to the regex engine) it replaces all escape sequences. So \\t will actually become a tab character in the string (not backslash-t). But you are using freespacing mode ( re.VERBOSE ). Therefore whitespace is insignificant. Your regex is equivalent to:

^*[ ]*S[iI][gG][nN][aA][Tt][uU][rR][eE][sS]?\s*[^0-9]$

\\s stays \\s , even in a non-raw string, because it is not a recognized escape-sequence in Python strings.

Then right at the beginning ^* is causing the problem, because you cannot repeat the anchor.

This is why you should always use raw strings to write regular expressions. Then \\t just stays backslash-t and the regex engine can interpret it as a tab.

The space in [ ] is not a problem, by the way, since even in verbose/freespacing mode, spaces in character classes are significant.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM