简体   繁体   English

Python - 正则表达式 - 特殊字符和 ñ

[英]Python - regex - special characters and ñ

I have this script to test a regex and how unicode behaves:我有这个脚本来测试正则表达式以及 unicode 的行为:

# -*- coding: utf-8 -*-
import re

p = "Solo voy si se sucedierón o se suceden mañana los siguienñes eventos:"

w = re.findall('[a-zA-ZÑñ]+',p.decode('utf-8'), re.UNICODE)

print(w)

And the print statement is showing this: print语句显示了这一点:

[u'Solo', u'voy', u'si', u'se', u'sucedier', u'n', u'o', u'se', u'suceden', u'ma', u'ana', u'los', u'siguien', u'es', u'eventos']

"sucedierón" is being transformed to "u'sucedier', u'n'" , and similarly "mañana" becomes "u'ma', u'ana'" . "sucedierón"正在转换为"u'sucedier', u'n'" ,类似地, "mañana"变成"u'ma', u'ana'"

I have tried decoding, adding '\\xc3\\xb1a' to the regex for 'Ñ'我试过解码,将'\\xc3\\xb1a''Ñ'的正则表达式中

Later after reading some docs I realized that using [a-zA-Z] just matches ASCII character.后来在阅读了一些文档后,我意识到使用[a-zA-Z]只是匹配 ASCII 字符。 That is why I had to change to r'\\b\\w+\\b' so I can add flags to the regex这就是为什么我必须更改为r'\\b\\w+\\b'以便我可以向正则表达式添加标志

w = re.findall(r'\b\w+\b', p, re.UNICODE) 

But this didn't work.但这没有用。

I also tried to decode() first and findall() later:我也尝试先decode()然后再findall()

p = "Solo voy si se sucedierón o se suceden mañana los siguienñes eventos:"
U = p.decode('utf8')

If I print variable U如果我打印变量U

"Solo voy si se sucedierón o se suceden mañana los siguienñes eventos:"

I see that the output is as expected, but when I use the findall() again:我看到输出符合预期,但是当我再次使用findall()时:

[u'Solo', u'voy', u'si', u'se', u'sucedier\xf3n', u'o', u'se', u'suceden', u'ma\xf1ana', u'los', u'siguien\xf1es', u'eventos']

Now the word is complete but ó is replaced with \\xf3n and ñ is replaced with \\xf1 , unicode values.现在这个词是完整的,但是ó被替换为\\xf3n并且ñ被替换为\\xf1 ,unicode 值。

How can I findall() and get the non-ASCII characters "ñ","á", "é", "í", "ó", "ú"如何findall()并获取非 ASCII 字符"ñ","á", "é", "í", "ó", "ú"

I now there are a lot of this kind of questions in SO, and believe me I read a lot of them, but i just cannot find the missing part.我现在有很多这样的问题,相信我,我读了很多,但我找不到缺失的部分。

EDIT编辑

I am using python 2.7我正在使用 python 2.7

EDIT 2 Can someone else try what @LetzerWille suggest?编辑 2其他人可以尝试@LetzerWille 的建议吗? Is not working for me不适合我

Regex with accented characters (diacritics) in Python Python中带有重音字符(变音符号)的正则表达式

The re.UNICODE flag allows you to use word characters \\w and word boundaries \\b with diacritics (accents and tildes). re.UNICODE标志允许您使用单词字符\\w和单词边界\\b与变音符号(重音和波浪号)。 This is extremely useful to match words in different languages.这对于匹配不同语言的单词非常有用。

  1. Decode your text from UTF-8 to 将您的文本从 UTF-8 解码为
  2. Make sure the pattern and the subject text are passed as to the regex functions.确保模式和主题文本作为传递给正则表达式函数。
  3. The result is an array of bytes that can be looped/mapped to encode back again to UTF-8结果是一个字节数组,可以循环/映射以再次编码回 UTF-8
  4. Printing the array shows non-ASCII bytes escaped, but it's safe to print each string independently.打印数组显示转义的非 ASCII 字节,但独立打印每个字符串是安全的。

Code:代码:

# -*- coding: utf-8 -*-
# http://stackoverflow.com/q/32872917/5290909
#python 2.7.9

import re

text = "Solo voy si se sucedierón o se suceden mañana los siguienñes eventos:"
# Decode to unicode
unicode_text = text.decode('utf8')

matches = re.findall(ur'\b\w+\b', unicode_text, re.UNICODE)

# Encode back again to UTF-8
utf8_matches = [ match.encode('utf-8') for match in matches ]

# Print every word
for utf8_word in utf8_matches:
    print utf8_word

ideone Demo ideone 演示

Your code should be written as:你的代码应该写成:

w = re.findall(u'[a-zA-ZÑñ]+', p.decode('utf-8'))

Please add other characters into the character class on your own, since I don't know the full set of characters you want to match.请自行将其他字符添加到字符类中,因为我不知道您要匹配的完整字符集。

When you are processing Unicode text, make sure that both the input string and the pattern are of unicode 1 type.处理 Unicode 文本时,请确保输入字符串和模式均为unicode 1类型。

1 unicode is logically an array of UTF-16 code units (in narrow build) or UTF-32 code units/code points (in wide build). 1 unicode在逻辑上是一组 UTF-16 代码单元(窄版本)或 UTF-32 代码单元/代码点(宽版本)。 If you intend to process Unicode text with Python, to avoid the issue with astral plane characters in narrow builds, I recommend using Python 3.3 and above, or always use wide build for other version.如果您打算使用 Python 处理 Unicode 文本,为了避免在窄版本中出现星体平面字符的问题,我建议使用 Python 3.3 及更高版本,或者始终使用其他版本的宽版本。

In Python 2, str is simply an array of bytes , so characters outside ASCII range in the pattern will simply be interpreted as the sequence of bytes making up that character in the source encoding:在 Python 2 中, str只是一个 bytes 数组,因此模式中 ASCII 范围之外的字符将被简单地解释为在源编码中构成该字符的字节序列:

>>> [i for i in '[a-zA-ZÑñ]+']
['[', 'a', '-', 'z', 'A', '-', 'Z', '\xc3', '\x91', '\xc3', '\xb1', ']', '+']  

Compare output of re.DEBUG when compiling the str and unicode object:编译strunicode对象时比较re.DEBUG输出:

>>> re.compile('[a-zA-ZÑñ]+', re.DEBUG)
max_repeat 1 4294967295
  in
    range (97, 122)
    range (65, 90)
    literal 195      # \xc3
    literal 145      # \x91
    literal 195
    literal 177
<_sre.SRE_Pattern object at 0x6fffffd0dd8>

>>> re.compile(u'[a-zA-ZÑñ]+', re.DEBUG)
max_repeat 1 4294967295
  in
    range (97, 122)
    range (65, 90)
    literal 209      # Ñ
    literal 241      # ñ
<_sre.SRE_Pattern object at 0x6ffffded030>

Since you are not using \\s , \\w , \\d , re.UNICODE flag has no effect and can be removed.由于您没有使用\\s\\w\\d ,因此re.UNICODE标志无效并且可以删除。

It works for me.这个对我有用。 I use Pycharm and i have set the console to utf-8.我使用 Pycharm,并将控制台设置为 utf-8。

You need to configure your output console to utf-8 ....您需要将输出控制台配置为 utf-8 ....

p = "Solo voy si se sucedierón o se suceden mañana los siguienñes eventos:"

w = re.findall('ñ',p, re.UNICODE)

print(w)

['ñ', 'ñ']

w = re.findall('[a-zA-ZÑñó:]+',p, re.UNICODE)

print(w)

['Solo', 'voy', 'si', 'se', 'sucedierón', 'o', 'se', 'suceden', 'mañana', 'los', 'siguienñes', 'eventos:']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM