简体   繁体   English

Python正则表达式模块“re”将Unicode字符与\\u匹配

[英]Python regex module "re" match unicode characters with \u

I am trying to identify and replace unicode characters from strings that I am processing to make keyword match filters.我正在尝试从我正在处理的字符串中识别和替换 unicode 字符,以制作关键字匹配过滤器。

For example, given the string例如,给定字符串

"Apple iPhone 12 mini A2176 128GB\u00a0(PRODUCT) Red!\u00a0Perfect condition! Unlocked!"

I want the output from when I use the re.sub function (replace the pattern with blank space " ") to be我希望使用 re.sub 函数时的输出(用空格“”替换模式)是

"Apple iPhone 12 mini A2176 128GB (PRODUCT) Red! Perfect condition! Unlocked!"

So I went to a regex build and test website and came up with this pattern所以我去了一个正则表达式构建和测试网站,并提出了这个模式

\\u[a-z|0-9]{4}

Which captures the 2 unicode strings它捕获了 2 个 unicode 字符串

\u00a0 and \u00a0

Now trying to apply it to my python code I first tried this snippet.现在尝试将它应用到我的 python 代码中,我首先尝试了这个片段。 Here I use the findall function to see if the code would return the unicode strings这里我使用findall函数来查看代码是否会返回 unicode 字符串

import re

strin = "Apple iPhone 12 mini A2176 128GB\u00a0(PRODUCT) Red!\u00a0Perfect condition! Unlocked!"


print(re.findall('\\u[a-z|0-9]{4}', strin))

which causes the following error to return这会导致以下错误返回

re.error: incomplete escape \u at position 0

I then tried adding an 'r' in front of the string pattern.然后我尝试在字符串模式前添加一个 'r'。 No error appears but there is no unicode string returned没有出现错误但没有返回 unicode 字符串

print(re.findall(r'\\u[a-z|0-9]{4}', strin))

output is an empty list [] I then tried the same 2 approaches but with only 1 backslash输出是一个空列表[]然后我尝试了相同的 2 种方法,但只有 1 个反斜杠

print(re.findall('\\u[az|0-9]{4}', strin)) gives SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-1: truncated \\uXXXX escape print(re.findall('\\u[az|0-9]{4}', strin))给出SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-1: truncated \\uXXXX escape

print(re.findall(r'\u[a-z|0-9]{4}', strin)) gives 
re.error: incomplete escape \u at position 0

You have multiple misunderstandings here (all of which are as such common FAQs).您在这里有多种误解(所有这些都是常见的常见问题解答)。

The argument to re.findall is a string. re.findall的参数是一个字符串。 In Python, backslashes in strings have to be escaped by doubling them.在 Python 中,字符串中的反斜杠必须通过将它们加倍来转义。 A better solution is to use the r"..." raw string notation, especially for regular expressions, which often need to contain literal backslashes for the actual regular expressions.更好的解决方案是使用r"..."原始字符串表示法,特别是对于正则表达式,它通常需要包含实际正则表达式的文字反斜杠。

The error message you get from findall tells you that the character escape \\u[\u003c/code> is incorrect because [ is not a hexadecimal number.您从findall获得的错误消息告诉您字符转义\\u[\u003c/code>不正确,因为[不是十六进制数。 (In fact, even if your regex wasn't syntactically incorrect, it matches way too much; the regex for a Unicode character escape in Python would be \\\\u[0-9a-f]{4} , not az .) (事实上​​,即使您的正则表达式在语法上没有错误,它也匹配太多;Python 中 Unicode 字符转义的正则表达式将是\\\\u[0-9a-f]{4} ,而不是az 。)

The character is a single Unicode glyph, containing a single character in the string.字符是单个 Unicode 字形,包含字符串中的单个字符。 You can't match it with a regex like that.你不能用这样的正则表达式来匹配它。 What you can match is eg可以匹配的是例如

re.findall(r'[\u0080-\uffef]', strin)

which contains a character class covering the range of non-ASCII characters in the Unicode Basic Multilingual Plane (including surrogates, which properly speaking we should exclude, but let's not go there for a beginner question. Maybe also note that there are Unicode characters outside the BMP, which can be matched with [\\U00010000-\\U0010FFFF] ).它包含一个字符类,涵盖 Unicode 基本多语言平面中的非 ASCII 字符范围(包括代理,正确地说,我们应该排除它,但我们不要去那里问初学者问题。也许还要注意,在BMP,可与[\\U00010000-\\U0010FFFF]匹配)。

(Tangentally, notice also that the character class [az|0-9] includes the literal character | in the character class. The | stands for alternation outside character classes, but inside [ ... ] everything except an initial ^ and - is just a literal character.) (切线地,还要注意字符类[az|0-9]包括字符类中的文字字符||代表字符类的交替,但在[ ... ]内,除了开头的^-之外的所有内容都是只是一个文字字符。)

But more fundamentally, the beginner reaction to "I don't understand this Unicode stuff" is wrong;但更根本的是,初学者对“我不明白这个 Unicode 东西”的反应是错误的; the response should be "I need to understand this stuff", not "I need to remove it".回应应该是“我需要了解这些东西”,而不是“我需要删除它”。 There is rarely a good case for simply removing all Unicode, and the tendency is only dragging you back into the dark ages before Unicode when you could only represent English text (and barely that) in Western computers.简单地删除所有 Unicode 的情况很少,而且这种趋势只会将您拖回到 Unicode 出现之前的黑暗时代,当时您只能在西方计算机中表示英文文本(而且几乎没有)。

A more principled solution to this specific problem is to canonicalize all whitespace characters (perhaps except tabs) to an ASCII space, and figure out how to tackle other Unicode characters as you bump into them.解决这个特定问题的一个更有原则的解决方案是将所有空白字符(可能除了制表符)规范化为 ASCII 空间,并在遇到其他 Unicode 字符时弄清楚如何处理它们。 What makes sense depends hugely on your specific application.什么有意义很大程度上取决于您的特定应用程序。 For search or NLP, it might make sense to canonicalize or "flatten" all text to a near-ASCII subset, but for many other applications, you usually need something a bit more nuanced.对于搜索或 NLP,将所有文本规范化或“扁平化”为接近 ASCII 的子集可能是有意义的,但对于许多其他应用程序,您通常需要更细微的东西。

With that out of the way, try有了这个,试试

Python 3.8.2 (default, May 18 2021, 11:47:11) 
[Clang 12.0.5 (clang-1205.0.22.9)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> strin = "Apple iPhone 12 mini A2176 128GB\u00a0(PRODUCT) Red!\u00a0Perfect condition! Unlocked!"
>>> import re
>>> re.sub(r'\s', ' ', strin)
'Apple iPhone 12 mini A2176 128GB (PRODUCT) Red! Perfect condition! Unlocked!'

If your purpose is to just remove unicode from your text then you are working way too hard.如果您的目的只是从文本中删除 unicode,那么您的工作就太辛苦了。 You can do it simple with你可以用

strin.encode('ascii', 'ignore').decode('ascii')

You encode your string as ascii and ignore the errors, then you decode it again as ascii thus removing all the non ascii characters您将字符串编码为 ascii 并忽略错误,然后将其再次解码为 ascii,从而删除所有非 ascii 字符

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM