python re（regex）是否可以替代\\ u unicode转义序列？

Question

Python treats \\uxxxx as a unicode character escape inside a string literal (eg u"\—" gets interpreted as Unicode character U+2014). Python将\\ uxxxx视为字符串文字内的Unicode字符转义符（例如u“ \\ u2014”被解释为Unicode字符U + 2014）。 But I just discovered (Python 2.7) that standard regex module doesn't treat \\uxxxx as a unicode character. 但是我刚刚发现（Python 2.7）标准正则表达式模块不会将\\ uxxxx视为Unicode字符。 Example: 例：

codepoint = 2014 # Say I got this dynamically from somewhere

test = u"This string ends with \u2014"
pattern = r"\u%s$" % codepoint
assert(pattern[-5:] == "2014$") # Ends with an escape sequence for U+2014
assert(re.search(pattern, test) != None) # Failure -- No match (bad)
assert(re.search(pattern, "u2014")!= None) # Success -- This matches (bad)

Obviously if you are able to specify your regex pattern as a string literal, then you can have the same effect as if the regex engine itself understood \\uxxxx escapes: 显然，如果您能够将正则表达式模式指定为字符串文字，则可以起到与正则表达式引擎本身理解\\ uxxxx转义相同的作用：

test = u"This string ends with \u2014"
pattern = u"\u2014$"
assert(pattern[:-1] == u"\u2014") # Ends with actual unicode char U+2014
assert(re.search(pattern, test) != None)

But what if you need to construct your pattern dynamically? 但是，如果您需要动态构建模式怎么办？

Answer 1

使用unichr()函数从代码点创建unicode字符：

pattern = u"%s$" % unichr(codepoint)

Answer 2

One possibility is, rather than call re methods directly, wrap them in something that can understand \\u escapes on their behalf. 一种可能性是，不是直接调用re方法，而是将它们包装在可以代表它们理解\\ u转义的内容中。 Something like this: 像这样：

def my_re_search(pattern, s):
    return re.search(unicode_unescape(pattern), s)

def unicode_unescape(s):
        """
        Turn \uxxxx escapes into actual unicode characters
        """
        def unescape_one_match(matchObj):
                escape_seq = matchObj.group(0)
                return escape_seq.decode('unicode_escape')
        return re.sub(r"\\u[0-9a-fA-F]{4}", unescape_one_match, s)

Example of it working: 工作示例：

pat  = r"C:\\.*\u20ac" # U+20ac is the euro sign
>>> print pat
C:\\.*\u20ac

path = ur"C:\reports\twenty\u20acplan.txt"
>>> print path
C:\reports\twenty€plan.txt

# Underlying re.search method fails to find a match
>>> re.search(pat, path) != None
False

# Vs this:
>>> my_re_search(pat, path) != None
True

Thanks to Process escape sequences in a string in Python for pointing out the decode("unicode_escape") idea. 多亏了Python中字符串中的处理转义序列，以指出decode（“ unicode_escape”）的想法。

But note that you can't just throw your whole pattern through decode("unicode_escape"). 但是请注意，您不能只通过解码（“ unicode_escape”）抛出整个模式。 It will work some of the time (because most regex special characters don't change their meaning when you put a backslash in front), but it won't work in general. 它有时会起作用（因为大多数正则表达式特殊字符在您加反斜杠时都不会改变其含义），但通常不会起作用。 For example, here using decode("unicode_escape") alters the meaning of the regex: 例如，在此处使用解码（“ unicode_escape”）会更改正则表达式的含义：

pat = r"C:\\.*\u20ac" # U+20ac is the euro sign
>>> print pat
C:\\.*\u20ac # Asks for a literal backslash

pat_revised  = pat.decode("unicode_escape")
>>> print pat_revised
C:\.*€ # Asks for a literal period (without a backslash)

python re（regex）是否可以替代\\ u unicode转义序列？

问题描述

2 个解决方案

解决方案1
4 2013-05-14 11:17:21

解决方案2
1 2013-05-14 11:14:11

python re（regex）是否可以替代\\ u unicode转义序列？

问题描述

2 个解决方案

解决方案1 4 2013-05-14 11:17:21

解决方案2 1 2013-05-14 11:14:11

解决方案1
4 2013-05-14 11:17:21

解决方案2
1 2013-05-14 11:14:11