I need to build an re pattern based on the unicode string (eg I have "word", and I need something like ^"word"| "word"). However the "word" can contain special re characters. To match the "word" as it is, I need to escape special re characters in unicode string. The basic re.escape() function does the job for ascii strings. How can I do this for unicode?
re.escape()
inserts a backslash before every character that's not an ASCII alphanumeric. This may in fact lead to a multitude of unnecessary backslashes to be inserted, however, Python ignores backslashes that don't start a recognized escape sequence, so there is no big harm done (except possibly some performance penalty).
But if you want to build a stricter escape()
, you can:
def escape(s):
return re.sub(r"[(){}\[\].*?|^$\\+-]", r"\\\g<0>", s)
which only touches the actual regex metacharacters. I sure hope I didn't miss any :)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.