简体   繁体   English

Python lxml & 字符串编码问题

[英]Python lxml & string encoding issue

I'm using lxml to extract text from html docs and I cannot get some characters from the text to render properly.我正在使用 lxml 从 html 文档中提取文本,但无法从文本中获取一些字符以正确呈现。 It's probably a stupid thing, but I can't seem to figure out a solution...这可能是一件愚蠢的事情,但我似乎无法找到解决方案......

Here's a simplified version of the html:这是 html 的简化版本:

<html>
    <head>
        <meta content="text/html" charset="UTF-8"/>
    </head>
    <body>
        <p>DAÑA – bis'e</p> <!---that's an N dash and the single quote is curly--->
    </body
</html

A simplified version of the code:代码的简化版本:

import lxml.html as LH
htmlfile = "path/to/file"
tree = LH.parse(htmlfile)
root = tree.getroot()
for para in root.iter("p"):
    print(para.text)

The output in my terminal has those little boxes with a character error (for example,我的终端中的输出有那些带有字符错误的小框(例如,

在此处输入图片说明

which should be "– E"), but if I copy-paste from there to here, it looks like:应该是“-E”),但如果我从那里复制粘贴到这里,它看起来像:

>>> DAÃO bisâe

If I do a simple echo + problem characters in the terminal they render properly, so I don't think that's the problem.如果我在终端中做一个简单的echo + 问题字符,它们会正确呈现,所以我认为这不是问题所在。

The html encoding is UTF-8 (checked with docinfo ). html 编码为 UTF-8(使用docinfo进行检查)。 I've tried .encode() and .decode() in various places in the code.我已经在代码的不同地方尝试过 .encode() 和 .decode() 。 I also tried the lxml.etree.tostring() with utf-8 declaration (but then .iter() doesn't work ('bytes' object has no attribute 'iter'), or if I put it towards the endnodes in the code, the .text doesn't work ('bytes' object has no attribute 'text')).我还尝试了带有 utf-8 声明的 lxml.etree.tostring()(但是 .iter() 不起作用('bytes' 对象没有属性 'iter'),或者如果我将它放在代码,.text 不起作用('bytes' 对象没有属性 'text'))。

Any ideas what's going wrong and/or how to solve?任何想法出了什么问题和/或如何解决?

Open the file with the correct encoding (I'm assuming UTF-8 here, look at the HTML file to confirm).使用正确的编码打开文件(我在这里假设为 UTF-8,请查看 HTML 文件以确认)。

import lxml.html as LH

with open("path/to/file", encoding="utf8") as f:
    tree = LH.parse(f)
    root = tree.getroot()
    for para in root.iter("p"):
        print(para.text)

Background explanation of how you arrived where you currently are.您如何到达当前位置的背景说明

Incoming data from the server:来自服务器的传入数据:

Bytes (hex)            Decoded as   Result String          Comment
44 41 C3 91 4F         UTF-8        DAÑO                   proper decode
44 41 C3 91 4F         Latin-1      DAÃ▯O                  improper decode

The bytes should not have been decoded as Latin-1, that's an error.字节不应该被解码为 Latin-1,这是一个错误。

C3 91 represents one character in UTF-8 (the Ñ) but it's two characters in Latin-1 (the Ã, and byte 91). C3 91代表 UTF-8 中的一个字符(Ñ),但它是 Latin-1 中的两个字符(Ã 和字节 91)。 But byte 91 is unused in Latin-1 , so there is no character to display.但是字节 91 在 Latin-1 中未使用,因此没有要显示的字符。 I took ▯ to make it visible.我用 ▯ 使其可见。 A text editor might skip it altogether, showing DAÃO instead, or a weird box, or an error marker.文本编辑器可能会完全跳过它,而是显示DAÃO ,或者一个奇怪的框,或者一个错误标记。

When writing the improperly decoded string to file:将解码不当的字符串写入文件时:

String                 Encoded as   Result Bytes (hex)     Comment
DAÃ▯O                  UTF-8        44 41 C3 83 C2 91 4F   weird box preserved as C2 91

The string should not have been encoded as UTF-8 at this point, that's an error, too.此时字符串不应该被编码为 UTF-8,这也是一个错误。

The à got converted to C3 83 , which is correct for this character in UTF-8. Ã转换为C3 83 ,这对于 UTF-8 中的此字符是正确的。 Note how the byte sequence now matches what you told me in the comments ( \\xc3\\x83\\xc2\\x91 ).请注意字节序列现在如何匹配您在评论中告诉我的内容( \\xc3\\x83\\xc2\\x91 )。

When reading that file:读取该文件时:

Bytes (hex)            Decoded as   Result String          Comment
44 41 C3 83 C2 91 4F   UTF-8        DAÃ▯O                  unprintable character is retained
44 41 C3 83 C2 91 4F   Latin-1      DAÃÂ▯O                unprintable character is retained

No matter how you decode that, it remains broken.不管你如何解码,它仍然是坏的。

Your data got mangled by making two mistakes in a row: decoding it improperly, and then re-encoding it improperly again .您的数据得到了由连续做了两个错误错位:不当对其进行解码,然后重新编码,它再次不当。 The right thing would have been to write the bytes from the server directly to disk, without converting them to string at any point.正确的做法是将字节从服务器直接写入磁盘,而无需在任何时候将它们转换为字符串。

I've found the unidecode package to work quite well converting non-ascii characters to the closest ascii.我发现unidecode包可以很好地将非 ascii 字符转换为最接近的 ascii。

from unidecode import unidecode
def check_ascii(in_string):
    if in_string.isascii():  # Available in python 3.7+
        return in_string
    else:
        return unidecode(in_string)  # Converts non-ascii characters to the closest ascii

Then if you believe some text might contain non-ascii characters you can pass it to the above function.然后,如果您认为某些文本可能包含非 ascii 字符,则可以将其传递给上述函数。 In your case after extracting the text between the html tags you can pass it with:在您的情况下,在提取 html 标签之间的文本后,您可以通过以下方式传递它:

for para in root.iter("p"):
    print(check_ascii(para.text))

You can find details about the package here: https://pypi.org/project/Unidecode/您可以在此处找到有关该软件包的详细信息: https : //pypi.org/project/Unidecode/

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM