为什么这个Python 3代码无法使用str.translate（）删除Unicode重音字符？

Question

我试图在Python 3中的字符串中规范化重音字符，如下所示：

from bs4 import BeautifulSoup
import os

def process_markup():
    #the file is utf-8 encoded
    fn = os.path.join(os.path.dirname(__file__), 'src.txt') #
    markup = BeautifulSoup(open(fn), from_encoding="utf-8")

    for player in markup.find_all("div", class_="glossary-player"):
        text = player.span.string
        print(format_filename(text)) # Python console shows mangled characters not in utf-8
        player.span.string.replace_with(format_filename(text))

    dest = open("dest.txt", "w", encoding="utf-8")
    dest.write(str(markup))

def format_filename(s):
    # prepare string
    s = s.strip().lower().replace(" ", "-").strip("'")

    # transliterate accented characters to non-accented versions
    chars_in = "àèìòùáéíóú"
    chars_out = "aeiouaeiou"
    no_accented_chars = str.maketrans(chars_in, chars_out)
    return s.translate(no_accented_chars)

process_markup()

输入的src.txt文件是utf-8编码的：

<div class="glossary-player">
    <span class="gd"> Fàilte </span><span class="en"> Welcome </span>
</div>
<div class="glossary-player">
    <span class="gd"> àèìòùáéíóú </span><span class="en"> aeiouaeiou </span>
</div>

输出文件dest.txt如下所示：

ï»¿<div class="glossary-player">
<span class="gd">fã ilte</span><span class="en"> Welcome </span>
</div>
<div class="glossary-player">
<span class="gd">ã ã¨ã¬ã²ã¹ã¡ã©ãã³ãº</span><span class="en"> aeiouaeiou </span>
</div>

我试图让它看起来像这样：

<div class="glossary-player">
<span class="gd">failte</span><span class="en"> Welcome </span>
</div>
<div class="glossary-player">
<span class="gd">aeiouaeiou</span><span class="en"> aeiouaeiou </span>
</div>

我知道有像unidecode这样的解决方案但只是想知道我在这里做错了什么。

Answer 1

chars.translate(no_accented_chars)不会修改chars 。 它返回一个应用了翻译的新字符串。 如果要使用已翻译的字符串，请将其保存到变量（可能是原始的chars变量）：

chars = chars.translate(no_accented_chars)

或直接传递给write调用：

dest.write(chars.translate(no_accented_chars))

Answer 2

我强烈怀疑您的HTML文件包含类似的内容

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

这基本上迫使BeautifulSoup将UTF-8重新解释为ISO-8859-1 （或者你在那里拥有的传统字符集; Windows-1252？Shudder）。

还有很多其他地方你可以在一个HTML块中添加一个charset=属性，但这将是典型的罪魁祸首。

Answer 3

问题是，正如triplee建议的那样，文件被解释为错误的编码。

文件中的数据是正确的（如十六进制转储所示），但可能由于缺少字符集声明，Python没有将其作为utf-8读取，而是作为cp1252读取。

要解决这个问题，有必要在使用Python的open（）方法打开文件时显式声明编码，所以行：

markup = BeautifulSoup(open(fn), from_encoding="utf-8")

需要改为：

markup = BeautifulSoup(open(fn, encoding="utf-8"))

为什么这个Python 3代码无法使用str.translate（）删除Unicode重音字符？

问题描述

3 个解决方案

解决方案1
3 2014-06-07 12:13:17

解决方案2
1 2014-06-07 16:56:04

解决方案3
0 已采纳 2014-06-07 17:12:06

为什么这个Python 3代码无法使用str.translate（）删除Unicode重音字符？

问题描述

3 个解决方案

解决方案1 3 2014-06-07 12:13:17

解决方案2 1 2014-06-07 16:56:04

解决方案3 0 已采纳 2014-06-07 17:12:06

解决方案1
3 2014-06-07 12:13:17

解决方案2
1 2014-06-07 16:56:04

解决方案3
0 已采纳 2014-06-07 17:12:06