简体   繁体   中英

Why does this Python 3 code fail to remove Unicode accented characters using str.translate()?

I am trying to normalise accented characters in a string in Python 3 like this:

from bs4 import BeautifulSoup
import os

def process_markup():
    #the file is utf-8 encoded
    fn = os.path.join(os.path.dirname(__file__), 'src.txt') #
    markup = BeautifulSoup(open(fn), from_encoding="utf-8")

    for player in markup.find_all("div", class_="glossary-player"):
        text = player.span.string
        print(format_filename(text)) # Python console shows mangled characters not in utf-8
        player.span.string.replace_with(format_filename(text))

    dest = open("dest.txt", "w", encoding="utf-8")
    dest.write(str(markup))

def format_filename(s):
    # prepare string
    s = s.strip().lower().replace(" ", "-").strip("'")

    # transliterate accented characters to non-accented versions
    chars_in = "àèìòùáéíóú"
    chars_out = "aeiouaeiou"
    no_accented_chars = str.maketrans(chars_in, chars_out)
    return s.translate(no_accented_chars)

process_markup()

The input src.txt file is utf-8 encoded:

<div class="glossary-player">
    <span class="gd"> Fàilte </span><span class="en"> Welcome </span>
</div>
<div class="glossary-player">
    <span class="gd"> àèìòùáéíóú </span><span class="en"> aeiouaeiou </span>
</div>

The output file dest.txt looks like this:

<div class="glossary-player">
<span class="gd">fã ilte</span><span class="en"> Welcome </span>
</div>
<div class="glossary-player">
<span class="gd">ã ã¨ã¬ã²ã¹ã¡ã©ã­ã³ãº</span><span class="en"> aeiouaeiou </span>
</div>

and I am trying to get it to look like this:

<div class="glossary-player">
<span class="gd">failte</span><span class="en"> Welcome </span>
</div>
<div class="glossary-player">
<span class="gd">aeiouaeiou</span><span class="en"> aeiouaeiou </span>
</div>

I know there's solutions like unidecode but just wanted to find out what I'm doing wrong here.

chars.translate(no_accented_chars) doesn't modify chars . It returns a new string with the translation applied. If you want to use the translated string, save it to a variable (perhaps the original chars variable):

chars = chars.translate(no_accented_chars)

or pass it directly to the write call:

dest.write(chars.translate(no_accented_chars))

I strongly suspect that your HTML file contains something like

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

which basically forces BeautifulSoup to reinterpret the UTF-8 as ISO-8859-1 (or whichever legacy charset you have there; Windows-1252? Shudder).

There is a number of other places you can add a charset= attribute to a block of HTML, but this would be the typical culprit.

The problem was that, as triplee suggested , the file being interpreted as the wrong encoding.

The data in the file was correct (as shown by a hex dump), but possibly due to the lack of a charset declaration Python did not read it in as utf-8, but as cp1252.

To fix this, it was necessary to explicitly state the encoding when opening the file using Python's open() method, so the line:

markup = BeautifulSoup(open(fn), from_encoding="utf-8")

needed to be changed to:

markup = BeautifulSoup(open(fn, encoding="utf-8"))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM