简体   繁体   中英

Regular Expression and escape sequences

I have a file which contains the list of regular expressions to look for in db.

one such pattern is (/|\\)cmd\\.com$ . But when i use it with re module, it throws up the below error. If i use the re pattern as (/|\\\\\\\\)cmd\\.com$ ,it works.

So, the question is when i read from a file in to variable for EX: a, how do i convert it to a reg pattern with four backward slashes so that it starts working with python re module.

Also, how do we escape such escape sequences when reg pattern is assigned to a variable EX: "a" below.

Any help on this is appreciated.

import re
a='(/|\)cmd\.com$'
re.compile(a)

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.6/re.py", line 190, in compile
    return _compile(pattern, flags)
  File "/usr/lib/python2.6/re.py", line 245, in _compile
    raise error, v # invalid expression
sre_constants.error: unbalanced parenthesis

Thx, Santhosh

First note that your original regex is invalid. It should be (/|\\\\)cmd\\.com$ . If such a string is coming from a database (or any other source other than a string literal in your code), then no additional manipulation needs to be done before the regex engine sees it -- the slashes are correct.

Full details and explanation:

Backslashes are special in that they escape other characters and give them different meanings.

a = '(/|\)cmd\.com$'

In this regular expression, the ) is special, indicating the end of a grouping expression; the backslash escapes it to make it interpreted as a literal ) instead, which is not what you want (and why you get the error about mismatched parentheses).

You need to escape the backslash to make it be interpreted as a literal \\ ; this can be done using yet another backslash:

a = '(/|\\)cmd\.com$'

However even this will not work, since in Python there are two levels of processing going on (and thus two levels of escaping are needed): First, the string literal is evaluated, and the backslashes are interpreted specially (string-wise, where eg \\. is not meaningful, and so evaluates to \\. -- however \\\\ evaluates to \\ ). Then, when the regex engine gets the string, it interprets any literal backslashes in that object specially (regex-wise, eg \\. makes the . literal instead of "any character"). So you end up with:

a = '(/|\\\\)cmd\\.com$'    # Escaped version of (/|\\)cmd\.com$ which is what regex engine will see

Because this problem is so common, Python has a way of writing strings such that the backslash is not treated specially in the string-processing stage: "raw" string literals :

a = r'(/|\\)cmd\.com$'    # backslashes here will be interpreted as literal \ characters

The regex engine will still interpret the backslashes in the string specially (a raw string is just a way of writing the literal; it still results in a plain str object).

In your example above, you need to make the regex pattern a Python "raw" string, like so:

  re.compile(r'put the pattern here')

If you post your code I might be able to help with your question about loading patterns from a file.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM