简体   繁体   English

为 Python2 和 Python3 编写 unicode 正则表达式

[英]Writing unicode regex for both Python2 and Python3

I can use the ur'something' and the re.U flag in Python2 to compile a regex pattern, eg:我可以在 Python2 中使用ur'something're.U标志来编译正则表达式模式,例如:

$ python2
Python 2.7.13 (default, Dec 18 2016, 07:03:39) 
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> pattern = re.compile(ur'(«)', re.U)
>>> s = u'«abc «def«'
>>> re.sub(pattern, r' \1 ', s)
u' \xab abc  \xab def \xab '
>>> print re.sub(pattern, r' \1 ', s)
 « abc  « def « 

In Python3, I can avoid the u'something' and even the re.U flag:在 Python3 中,我可以避免使用u'something'甚至re.U标志:

$ python3
Python 3.5.2 (default, Oct 11 2016, 04:59:56) 
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.38)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> pattern = re.compile(r'(«)')
>>> s = u'«abc «def«'
>>> print( re.sub(pattern, r' \1 ', s))
 « abc  « def « 

But the goal is to write the regex such that it supports both Python2 and Python3.但目标是编写正则表达式,使其同时支持 Python2 和 Python3。 And doing ur'something' in Python3 would result in a syntax error:在 Python3 中执行ur'something'会导致语法错误:

>>> pattern = re.compile(ur'(«)', re.U)
  File "<stdin>", line 1
    pattern = re.compile(ur'(«)', re.U)
                               ^
SyntaxError: invalid syntax

Since it's a syntax error, even checking versions before declaring the pattern wouldn't work in Python3:由于这是一个语法错误,即使在声明模式之前检查版本在 Python3 中也不起作用:

>>> import sys
>>> _pattern = r'(«)' if sys.version_info[0] == 3 else ur'(«)'
  File "<stdin>", line 1
    _pattern = r'(«)' if sys.version_info[0] == 3 else ur'(«)'
                                                             ^
SyntaxError: invalid syntax

How to unicode regex to support both Python2 and Python3?如何对正则表达式进行 unicode 以同时支持 Python2 和 Python3?


Although r' ' could easily be replaced by u' ' by dropping the literal string in this case.尽管在这种情况下,通过删除文字字符串, r' '可以很容易地被u' '替换。

There are complicated regexes that sort of requires the r' ' for sanity sake, eg为了理智起见,有一些复杂的正则表达式需要r' ' ,例如

re.sub(re.compile(r'([^\.])(\.)([\]\)}>"\'»]*)\s*$', re.U), r'\1 \2\3 ', s)

So the solution should include the literal string r' ' usage unless there're other ways to get around it.所以解决方案应该包括文字字符串r' '用法,除非有其他方法可以绕过它。 But do note that using string literals or unicode_literals or from __future__ is undesired since it will cause tonnes of other problems, esp.但请注意,使用字符串文字或unicode_literals或来自__future__是不受欢迎的,因为它会导致大量其他问题,尤其是。 in other parts of the code base that I work with, see http://python-future.org/unicode_literals.html在我使用的代码库的其他部分,请参阅http://python-future.org/unicode_literals.html

For specific reason why the code base discourages unicode_literals import but uses the r' ' notation is because filled with it and making changes to each one of them is going to be extremely painful, eg由于代码库不鼓励 unicode_literals 导入但使用r' '符号的特定原因是因为填充它并对它们中的每一个进行更改将是非常痛苦的,例如

Do you really need raw strings?你真的需要原始字符串吗? For your example, a unicode string is needed, but not a raw string.对于您的示例,需要 unicode 字符串,但不需要原始字符串。 Raw strings are a convenience, but not required - just double any \ you would use in the raw string and use plain unicode.原始字符串很方便,但不是必需的 - 只需将您在原始字符串中使用的任何\加倍并使用纯 unicode。

Python 2 allows concatenating a raw string with a unicode string (resulting in a unicode string), so you could use r'([^\.])(\.)([\]\)}>"\'' u'»' r']*)\s*$' Python 2 允许将原始字符串与 unicode 字符串连接(产生 unicode 字符串),因此您可以使用r'([^\.])(\.)([\]\)}>"\'' u'»' r']*)\s*$'
In Python 3, they will all be unicode, so that will work too.在 Python 3 中,它们都是 unicode,所以也可以。

["

I had the same problem, and I ended up doing something like this using the dangerous eval() function.<\/i>我遇到了同样的问题,最后我使用危险的 eval() 函数做了这样的事情。<\/b> It know it's not pretty, but it allows my code to work in both Python 2 and Python 3.<\/i>它知道它不漂亮,但它允许我的代码在 Python 2 和 Python 3 中工作。<\/b><\/p>

if sys.version_info.major == 2:
    pattern = eval("re.compile(ur'(\u00ab)', re.U)")
else:
    pattern = re.compile(r'(«)', re.U)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM