简体   繁体   English

Python Regex-在字符串中匹配混合的Unicode和ASCII字符

[英]Python Regex - Matching mixed Unicode and ASCII characters in a string

I've tried in several different ways and none of them work. 我已经尝试了几种不同的方式,但都没有用。

Suppose I have a string s defined as follows: 假设我有一个字符串,定义如下:

s = '[မန္း],[aa]'.decode('utf-8')

Suppose I want to parse the two strings within the square brackes. 假设我想解析方括号内的两个字符串。 I've compiled the following regex: 我已经编译了以下正则表达式:

pattern = re.compile(r'\[(\w+)\]', re.UNICODE)

and then I look for occurrences using: 然后我使用以下命令查找事件:

pattern.findall(s, re.UNICODE)

The result is basically just [] instead of the expected list of two matches. 结果基本上只是[]而不是两个匹配项的预期列表。 Furthermore if I remove the re.UNICODE from the findall call I get the single string [u'aa'] , ie the non-unicode one: 此外,如果我从findall调用中删除re.UNICODE,则会得到单个字符串[u'aa'] ,即非unicode字符串:

pattern.findall(s)

Of course 当然

s = '[bb],[aa]'.decode('utf-8')
pattern.findall(s)

returns [u'bb', u'aa'] 返回[u'bb', u'aa']

And to make things even more interesting: 并使事情变得更加有趣:

s = '[မနbb],[aa]'.decode('utf-8')
pattern.findall(s)

returns [u'\မ\နbb', u'aa'] 返回[u'\မ\နbb', u'aa']

It's actually rather simple. 实际上很简单。 \\w matches all alphanumeric characters and not all of the characters in your initial string are alphanumeric. \\w匹配所有字母数字字符,并且并非初始字符串中的所有字符都是字母数字。

If you still want to match all characters between the brackets, one solution is to match everything but a closing bracket ( ] ). 如果仍要匹配方括号之间的所有字符,则一种解决方案是匹配除右方括号( ]所有字符。 This can be made as 这可以做成

import re
s = '[မန္း],[aa]'.decode('utf-8')
pattern = re.compile('\[([^]]+)\]', re.UNICODE)
re.findall(pattern, s)

where the [^]] creates a matching pattern of all characters except the ones following the circumflex ( ^ ) character. 其中[^]]为除回旋符( ^ )字符之外的所有字符创建匹配模式。

Also, note that the re.UNICODE argument to re.compile is not necessary, since the pattern itself does not contain any unicode characters. 另外,请注意,由于模式本身不包含任何Unicode字符,因此re.compilere.UNICODE参数不是必需的。

First, note that the following only works in Python 2.x if you've saved the source file in UTF-8 encoding, and you declare the source code encoding at the top of the file; 首先,请注意,只有将源文件保存为UTF-8编码,并且在文件顶部声明源代码编码后,以下内容才在Python 2.x中起作用。 otherwise, the default encoding of the source is assumed to be ascii : 否则,假定源的默认编码为ascii

#coding: utf8
s = '[မန္း],[aa]'.decode('utf-8')

A shorter way to write it is to code a Unicode string directly: 一种较短的编写方法是直接编码Unicode字符串:

#coding: utf8
s = u'[မန္း],[aa]'

Next, \\w matches alphanumeric characters. 接下来, \\w匹配字母数字字符。 With the re.UNICODE flag it matches characters that are categorized as alphanumeric in the Unicode database. 使用re.UNICODE标志,它与Unicode数据库中归类为字母数字的字符匹配。 Not all of the characters in မန္း are alphanumeric. 并非မန္း中的所有字符都是字母数字。 If you want whatever is between the brackets, use something like the following. 如果您想要括号之间的内容,请使用以下类似内容。 Note the use of .*? 注意使用.*? for a non-greedy match of everything. 进行所有内容的非贪婪匹配。 It's also a good habit to use Unicode strings for all text, and raw strings in particular for regular expressions. 对所有文本都使用Unicode字符串,尤其是对正则表达式使用原始字符串也是一个好习惯。

#coding:utf8
import re
s = u'[မန္း],[aa],[မနbb]'
pattern = re.compile(ur'\[(.*?)\]')
print re.findall(pattern,s)

Output: 输出:

[u'\u1019\u1014\u1039\u1038', u'aa', u'\u1019\u1014bb']

Note that Python 2 displays an unambiguous version of the strings in lists with escape codes for non-ASCII and non-printable characters. 请注意,Python 2在列表中显示字符串的明确版本,并带有用于非ASCII和不可打印字符的转义码。

To see the actual string content, print the strings, not the list: 要查看实际的字符串内容,请打印字符串,而不是列表:

for item in re.findall(pattern,s):
    print item

Output: 输出:

မန္း
aa
မနbb

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM