简体   繁体   English

Python + Regex + UTF-8无法识别重音

[英]Python + Regex + UTF-8 doesn't recognize accents

My problem is that Python, using regex and re.search() doesn't recognize accents even though I use utf-8. 我的问题是,即使我使用utf-8,使用regex和re.search()的Python也无法识别口音。 Here is my string of code; 这是我的代码串;

#! /usr/bin/python
-*- coding: utf-8 -*-
import re

htmlString = '</dd><dt> Fine, thank you.&#160;</dt><dd> Molt bé, gràcies.'

SearchStr = '(\<\/dd\>\<dt\>)+ ([\w+\,\.\s]+)([\&\#\d\;]+)(\<\/dt\>\<dd\>)+ (\w+) (\w+)'

Result = re.search(SearchStr, htmlString)

if Result:
print Result.groups()

passavol23:jO$ catalanword.py
('</dd><dt>', 'Fine, thank you.', '&#160;', '</dt><dd>', 'Molt', 'b')

So the problem is that it doesn't recognizes the é and thus stops. 因此,问题在于它无法识别é并因此停止。 Any help would be appreciated. 任何帮助,将不胜感激。 Im a Python beginner. 我是Python初学者。

By default, \\w only matches ascii characters, it translates to [a-zA-Z0-9_] . 默认情况下, \\w仅匹配ASCII字符,它转换为[a-zA-Z0-9_] And matching UTF-8 bytes using regular expressions is hard enough, let alone only matching word characters , you'd have to match byte ranges instead. 使用正则表达式匹配UTF-8字节已经足够困难,更不用说仅匹配单词字符了 ,您必须匹配字节范围。

You'll need to decode from UTF-8 to unicode and use the re.UNICODE flag instead: 您需要从UTF-8解码为unicode并改用re.UNICODE标志

>>> re.search(SearchStr, htmlString.decode('utf8'), re.UNICODE).groups()
(u'</dd><dt>', u'Fine, thank you.', u'&#160;', u'</dt><dd>', u'Molt', u'b\xe9')

However, you should really be using a HTML parser to deal with HTML instead. 但是,您实际上应该使用HTML解析器来处理HTML。 Use BeautifulSoup, for example. 例如,使用BeautifulSoup。 It'll handle encoding and Unicode correctly for you. 它会为您正确处理编码和Unicode。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM