简体   繁体   English

将 lxml.html 与损坏的 html 实体一起使用?

[英]Using lxml.html with broken html entities?

I need to work with a page, which has an unfortunate mix of correct and incorrect HTML entities;我需要处理一个页面,该页面不幸地混合了正确和不正确的 HTML 实体; for instance:例如:

<i>Kristj&aacuten V&iacute;ctor</i>

This, in Firefox 67, does get interpreted correctly, eventually:这在 Firefox 67 中确实得到了正确解释,最终:

ff-htmlent1.png

... however, if we do "View Source", Firefox indicates via syntax color that something is wrong with the first HTML entity: ...然而,如果我们执行“查看源代码”,Firefox 会通过语法颜色指示第一个 HTML 实体有问题:

ff-htmlent2.png

... and indeed there is, a semicolon at the end of the HTML entity is missing - however, somehow Firefox figures it out, and renders the right character. ...确实有,HTML 实体末尾的分号丢失了 - 但是,Firefox 以某种方式弄清楚了,并呈现了正确的字符。

Now, if I try to work with that in lxml:现在,如果我尝试在 lxml 中使用它:

#!/usr/bin/env python3

import lxml.html as LH
import lxml.html.clean as LHclean

testhtmlstring = "<i>Kristj&aacuten V&iacute;ctor</i>"

myhtml = LH.fromstring( testhtmlstring )
myhtml = LHclean.clean_html( myhtml )
myitem = myhtml.xpath("//i")[0]
myitemstr = myitem.text_content()
print(myitemstr)

... the code prints out this in terminal (Ubuntu 18.04): ...代码在终端(Ubuntu 18.04)中打印出来:

Kristj&aacuten Víctor

... so, obviously, the broken htmlentity did not get converted to the right character. ...因此,显然,损坏的 htmlentity 没有转换为正确的字符。

Is there something I can use, so I get the right character in my output string from lxml, even in case of a broken htmlentity (as Firefox does)?有什么我可以使用的,所以我从 lxml 的输出字符串中得到正确的字符,即使是在 htmlentity 损坏的情况下(就像 Firefox 一样)?

The HTML 5 standard has specified a specific subset of entities that can be parsed without the trailing semicolon present, because these entities were historically defined with the semicolon being optional . HTML 5 标准指定了一个特定的实体子集,可以在不存在尾随分号的情况下解析这些实体,因为这些实体在历史上定义时分号是 optional

Thehtml.unescape() function explicitly supports those, use that function as a second pass to clear out this issue: html.unescape()函数明确支持这些,使用该函数作为第二遍来清除这个问题:

>>> from html import unescape
>>> unescape("Kristj&aacuten Víctor")
'Kristján Víctor'

If you install html5lib then you can have lxml behave the same, via their lxml.html.html5parser module (which wraps html5lib 's own html5lib.treebuilders.etree_lxml adapter ):如果您安装html5lib那么您可以通过它们的lxml.html.html5parser模块(包装html5lib自己的html5lib.treebuilders.etree_lxml适配器)让 lxml 表现相同:

>>> from lxml.html import html5parser as etree
>>> etree.fromstring("Kristj&aacuten Víctor").text
'Kristján Víctor'

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM