[英]Python - regex - special characters and ñ
I have this script to test a regex and how unicode behaves:我有这个脚本来测试正则表达式以及 unicode 的行为:
# -*- coding: utf-8 -*-
import re
p = "Solo voy si se sucedierón o se suceden mañana los siguienñes eventos:"
w = re.findall('[a-zA-ZÑñ]+',p.decode('utf-8'), re.UNICODE)
print(w)
And the print
statement is showing this: print
语句显示了这一点:
[u'Solo', u'voy', u'si', u'se', u'sucedier', u'n', u'o', u'se', u'suceden', u'ma', u'ana', u'los', u'siguien', u'es', u'eventos']
"sucedierón"
is being transformed to "u'sucedier', u'n'"
, and similarly "mañana"
becomes "u'ma', u'ana'"
. "sucedierón"
正在转换为"u'sucedier', u'n'"
,类似地, "mañana"
变成"u'ma', u'ana'"
。
I have tried decoding, adding '\\xc3\\xb1a'
to the regex for 'Ñ'
我试过解码,将
'\\xc3\\xb1a'
到'Ñ'
的正则表达式中
Later after reading some docs I realized that using [a-zA-Z]
just matches ASCII character.后来在阅读了一些文档后,我意识到使用
[a-zA-Z]
只是匹配 ASCII 字符。 That is why I had to change to r'\\b\\w+\\b'
so I can add flags to the regex这就是为什么我必须更改为
r'\\b\\w+\\b'
以便我可以向正则表达式添加标志
w = re.findall(r'\b\w+\b', p, re.UNICODE)
But this didn't work.但这没有用。
I also tried to decode()
first and findall()
later:我也尝试先
decode()
然后再findall()
:
p = "Solo voy si se sucedierón o se suceden mañana los siguienñes eventos:"
U = p.decode('utf8')
If I print variable U
如果我打印变量
U
"Solo voy si se sucedierón o se suceden mañana los siguienñes eventos:"
I see that the output is as expected, but when I use the findall()
again:我看到输出符合预期,但是当我再次使用
findall()
时:
[u'Solo', u'voy', u'si', u'se', u'sucedier\xf3n', u'o', u'se', u'suceden', u'ma\xf1ana', u'los', u'siguien\xf1es', u'eventos']
Now the word is complete but ó
is replaced with \\xf3n
and ñ
is replaced with \\xf1
, unicode values.现在这个词是完整的,但是
ó
被替换为\\xf3n
并且ñ
被替换为\\xf1
,unicode 值。
How can I findall()
and get the non-ASCII characters "ñ","á", "é", "í", "ó", "ú"
如何
findall()
并获取非 ASCII 字符"ñ","á", "é", "í", "ó", "ú"
I now there are a lot of this kind of questions in SO, and believe me I read a lot of them, but i just cannot find the missing part.我现在有很多这样的问题,相信我,我读了很多,但我找不到缺失的部分。
EDIT编辑
I am using python 2.7我正在使用 python 2.7
EDIT 2 Can someone else try what @LetzerWille suggest?编辑 2其他人可以尝试@LetzerWille 的建议吗? Is not working for me
不适合我
The re.UNICODE
flag allows you to use word characters \\w
and word boundaries \\b
with diacritics (accents and tildes). re.UNICODE
标志允许您使用单词字符\\w
和单词边界\\b
与变音符号(重音和波浪号)。 This is extremely useful to match words in different languages.这对于匹配不同语言的单词非常有用。
Code:代码:
# -*- coding: utf-8 -*-
# http://stackoverflow.com/q/32872917/5290909
#python 2.7.9
import re
text = "Solo voy si se sucedierón o se suceden mañana los siguienñes eventos:"
# Decode to unicode
unicode_text = text.decode('utf8')
matches = re.findall(ur'\b\w+\b', unicode_text, re.UNICODE)
# Encode back again to UTF-8
utf8_matches = [ match.encode('utf-8') for match in matches ]
# Print every word
for utf8_word in utf8_matches:
print utf8_word
Your code should be written as:你的代码应该写成:
w = re.findall(u'[a-zA-ZÑñ]+', p.decode('utf-8'))
Please add other characters into the character class on your own, since I don't know the full set of characters you want to match.请自行将其他字符添加到字符类中,因为我不知道您要匹配的完整字符集。
When you are processing Unicode text, make sure that both the input string and the pattern are of unicode
1 type.处理 Unicode 文本时,请确保输入字符串和模式均为
unicode
1类型。
1 unicode
is logically an array of UTF-16 code units (in narrow build) or UTF-32 code units/code points (in wide build). 1
unicode
在逻辑上是一组 UTF-16 代码单元(窄版本)或 UTF-32 代码单元/代码点(宽版本)。 If you intend to process Unicode text with Python, to avoid the issue with astral plane characters in narrow builds, I recommend using Python 3.3 and above, or always use wide build for other version.如果您打算使用 Python 处理 Unicode 文本,为了避免在窄版本中出现星体平面字符的问题,我建议使用 Python 3.3 及更高版本,或者始终使用其他版本的宽版本。
In Python 2, str
is simply an array of bytes , so characters outside ASCII range in the pattern will simply be interpreted as the sequence of bytes making up that character in the source encoding:在 Python 2 中,
str
只是一个 bytes 数组,因此模式中 ASCII 范围之外的字符将被简单地解释为在源编码中构成该字符的字节序列:
>>> [i for i in '[a-zA-ZÑñ]+']
['[', 'a', '-', 'z', 'A', '-', 'Z', '\xc3', '\x91', '\xc3', '\xb1', ']', '+']
Compare output of re.DEBUG
when compiling the str
and unicode
object:编译
str
和unicode
对象时比较re.DEBUG
输出:
>>> re.compile('[a-zA-ZÑñ]+', re.DEBUG)
max_repeat 1 4294967295
in
range (97, 122)
range (65, 90)
literal 195 # \xc3
literal 145 # \x91
literal 195
literal 177
<_sre.SRE_Pattern object at 0x6fffffd0dd8>
>>> re.compile(u'[a-zA-ZÑñ]+', re.DEBUG)
max_repeat 1 4294967295
in
range (97, 122)
range (65, 90)
literal 209 # Ñ
literal 241 # ñ
<_sre.SRE_Pattern object at 0x6ffffded030>
Since you are not using \\s
, \\w
, \\d
, re.UNICODE
flag has no effect and can be removed.由于您没有使用
\\s
、 \\w
、 \\d
,因此re.UNICODE
标志无效并且可以删除。
It works for me.这个对我有用。 I use Pycharm and i have set the console to utf-8.
我使用 Pycharm,并将控制台设置为 utf-8。
You need to configure your output console to utf-8 ....您需要将输出控制台配置为 utf-8 ....
p = "Solo voy si se sucedierón o se suceden mañana los siguienñes eventos:"
w = re.findall('ñ',p, re.UNICODE)
print(w)
['ñ', 'ñ']
w = re.findall('[a-zA-ZÑñó:]+',p, re.UNICODE)
print(w)
['Solo', 'voy', 'si', 'se', 'sucedierón', 'o', 'se', 'suceden', 'mañana', 'los', 'siguienñes', 'eventos:']
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.