Python lxml & 字符串编码问题

Question

I'm using lxml to extract text from html docs and I cannot get some characters from the text to render properly.我正在使用 lxml 从 html 文档中提取文本，但无法从文本中获取一些字符以正确呈现。 It's probably a stupid thing, but I can't seem to figure out a solution...这可能是一件愚蠢的事情，但我似乎无法找到解决方案......

Here's a simplified version of the html:这是 html 的简化版本：

<html>
    <head>
        <meta content="text/html" charset="UTF-8"/>
    </head>
    <body>
        <p>DAÑA – bis'e</p> <!---that's an N dash and the single quote is curly--->
    </body
</html

A simplified version of the code:代码的简化版本：

import lxml.html as LH
htmlfile = "path/to/file"
tree = LH.parse(htmlfile)
root = tree.getroot()
for para in root.iter("p"):
    print(para.text)

The output in my terminal has those little boxes with a character error (for example,我的终端中的输出有那些带有字符错误的小框（例如，

which should be "– E"), but if I copy-paste from there to here, it looks like:应该是“-E”），但如果我从那里复制粘贴到这里，它看起来像：

>>> DAÃO bisâe

If I do a simple echo + problem characters in the terminal they render properly, so I don't think that's the problem.如果我在终端中做一个简单的echo + 问题字符，它们会正确呈现，所以我认为这不是问题所在。

The html encoding is UTF-8 (checked with docinfo ). html 编码为 UTF-8（使用docinfo进行检查）。 I've tried .encode() and .decode() in various places in the code.我已经在代码的不同地方尝试过 .encode() 和 .decode() 。 I also tried the lxml.etree.tostring() with utf-8 declaration (but then .iter() doesn't work ('bytes' object has no attribute 'iter'), or if I put it towards the endnodes in the code, the .text doesn't work ('bytes' object has no attribute 'text')).我还尝试了带有 utf-8 声明的 lxml.etree.tostring()（但是 .iter() 不起作用（'bytes' 对象没有属性 'iter'），或者如果我将它放在代码，.text 不起作用（'bytes' 对象没有属性 'text'））。

Any ideas what's going wrong and/or how to solve?任何想法出了什么问题和/或如何解决？

Answer 1

Open the file with the correct encoding (I'm assuming UTF-8 here, look at the HTML file to confirm).使用正确的编码打开文件（我在这里假设为 UTF-8，请查看 HTML 文件以确认）。

import lxml.html as LH

with open("path/to/file", encoding="utf8") as f:
    tree = LH.parse(f)
    root = tree.getroot()
    for para in root.iter("p"):
        print(para.text)

Background explanation of how you arrived where you currently are.您如何到达当前位置的背景说明。

Incoming data from the server:来自服务器的传入数据：

Bytes (hex)            Decoded as   Result String          Comment
44 41 C3 91 4F         UTF-8        DAÑO                   proper decode
44 41 C3 91 4F         Latin-1      DAÃ▯O                  improper decode

The bytes should not have been decoded as Latin-1, that's an error.字节不应该被解码为 Latin-1，这是一个错误。

C3 91 represents one character in UTF-8 (the Ñ) but it's two characters in Latin-1 (the Ã, and byte 91). C3 91代表 UTF-8 中的一个字符（Ñ），但它是 Latin-1 中的两个字符（Ã 和字节 91）。 But byte 91 is unused in Latin-1 , so there is no character to display.但是字节 91 在 Latin-1 中未使用，因此没有要显示的字符。 I took ▯ to make it visible.我用 ▯ 使其可见。 A text editor might skip it altogether, showing DAÃO instead, or a weird box, or an error marker.文本编辑器可能会完全跳过它，而是显示DAÃO ，或者一个奇怪的框，或者一个错误标记。

When writing the improperly decoded string to file:将解码不当的字符串写入文件时：

String                 Encoded as   Result Bytes (hex)     Comment
DAÃ▯O                  UTF-8        44 41 C3 83 C2 91 4F   weird box preserved as C2 91

The string should not have been encoded as UTF-8 at this point, that's an error, too.此时字符串不应该被编码为 UTF-8，这也是一个错误。

The Ã got converted to C3 83 , which is correct for this character in UTF-8. Ã转换为C3 83 ，这对于 UTF-8 中的此字符是正确的。 Note how the byte sequence now matches what you told me in the comments ( \\xc3\\x83\\xc2\\x91 ).请注意字节序列现在如何匹配您在评论中告诉我的内容（ \\xc3\\x83\\xc2\\x91 ）。

When reading that file:读取该文件时：

Bytes (hex)            Decoded as   Result String          Comment
44 41 C3 83 C2 91 4F   UTF-8        DAÃ▯O                  unprintable character is retained
44 41 C3 83 C2 91 4F   Latin-1      DAÃƒÂ▯O                unprintable character is retained

No matter how you decode that, it remains broken.不管你如何解码，它仍然是坏的。

Your data got mangled by making two mistakes in a row: decoding it improperly, and then re-encoding it improperly again .您的数据得到了由连续做了两个错误错位：不当对其进行解码，然后重新编码，它再次不当。 The right thing would have been to write the bytes from the server directly to disk, without converting them to string at any point.正确的做法是将字节从服务器直接写入磁盘，而无需在任何时候将它们转换为字符串。

Answer 2

I've found the unidecode package to work quite well converting non-ascii characters to the closest ascii.我发现unidecode包可以很好地将非 ascii 字符转换为最接近的 ascii。

from unidecode import unidecode
def check_ascii(in_string):
    if in_string.isascii():  # Available in python 3.7+
        return in_string
    else:
        return unidecode(in_string)  # Converts non-ascii characters to the closest ascii

Then if you believe some text might contain non-ascii characters you can pass it to the above function.然后，如果您认为某些文本可能包含非 ascii 字符，则可以将其传递给上述函数。 In your case after extracting the text between the html tags you can pass it with:在您的情况下，在提取 html 标签之间的文本后，您可以通过以下方式传递它：

for para in root.iter("p"):
    print(check_ascii(para.text))

You can find details about the package here: https://pypi.org/project/Unidecode/您可以在此处找到有关该软件包的详细信息： https : //pypi.org/project/Unidecode/

Python lxml & 字符串编码问题

问题描述

2 个解决方案

解决方案1
1 已采纳 2019-12-11 09:24:46

解决方案2
0 2019-12-11 09:12:40

Python lxml &amp; 字符串编码问题

问题描述

2 个解决方案

解决方案1 1 已采纳 2019-12-11 09:24:46

解决方案2 0 2019-12-11 09:12:40

Python lxml & 字符串编码问题

解决方案1
1 已采纳 2019-12-11 09:24:46

解决方案2
0 2019-12-11 09:12:40