将 lxml.html 与损坏的 html 实体一起使用？

Question

I need to work with a page, which has an unfortunate mix of correct and incorrect HTML entities;我需要处理一个页面，该页面不幸地混合了正确和不正确的 HTML 实体； for instance:例如：

<i>Kristj&aacuten V&iacute;ctor</i>

This, in Firefox 67, does get interpreted correctly, eventually:这在 Firefox 67 中确实得到了正确解释，最终：

... however, if we do "View Source", Firefox indicates via syntax color that something is wrong with the first HTML entity: ...然而，如果我们执行“查看源代码”，Firefox 会通过语法颜色指示第一个 HTML 实体有问题：

... and indeed there is, a semicolon at the end of the HTML entity is missing - however, somehow Firefox figures it out, and renders the right character. ...确实有，HTML 实体末尾的分号丢失了 - 但是，Firefox 以某种方式弄清楚了，并呈现了正确的字符。

Now, if I try to work with that in lxml:现在，如果我尝试在 lxml 中使用它：

#!/usr/bin/env python3

import lxml.html as LH
import lxml.html.clean as LHclean

testhtmlstring = "<i>Kristj&aacuten V&iacute;ctor</i>"

myhtml = LH.fromstring( testhtmlstring )
myhtml = LHclean.clean_html( myhtml )
myitem = myhtml.xpath("//i")[0]
myitemstr = myitem.text_content()
print(myitemstr)

... the code prints out this in terminal (Ubuntu 18.04): ...代码在终端（Ubuntu 18.04）中打印出来：

Kristj&aacuten Víctor

... so, obviously, the broken htmlentity did not get converted to the right character. ...因此，显然，损坏的 htmlentity 没有转换为正确的字符。

Is there something I can use, so I get the right character in my output string from lxml, even in case of a broken htmlentity (as Firefox does)?有什么我可以使用的，所以我从 lxml 的输出字符串中得到正确的字符，即使是在 htmlentity 损坏的情况下（就像 Firefox 一样）？

Answer 1

The HTML 5 standard has specified a specific subset of entities that can be parsed without the trailing semicolon present, because these entities were historically defined with the semicolon being optional . HTML 5 标准指定了一个特定的实体子集，可以在不存在尾随分号的情况下解析这些实体，因为这些实体在历史上定义时分号是 optional 。

Thehtml.unescape() function explicitly supports those, use that function as a second pass to clear out this issue: html.unescape()函数明确支持这些，使用该函数作为第二遍来清除这个问题：

>>> from html import unescape
>>> unescape("Kristj&aacuten Víctor")
'Kristján Víctor'

If you install html5lib then you can have lxml behave the same, via their lxml.html.html5parser module (which wraps html5lib 's own html5lib.treebuilders.etree_lxml adapter ):如果您安装html5lib那么您可以通过它们的lxml.html.html5parser模块（包装html5lib自己的html5lib.treebuilders.etree_lxml适配器）让 lxml 表现相同：

>>> from lxml.html import html5parser as etree
>>> etree.fromstring("Kristj&aacuten Víctor").text
'Kristján Víctor'

将 lxml.html 与损坏的 html 实体一起使用？

问题描述

1 个解决方案

解决方案1
2 已采纳 2019-06-16 21:03:57

将 lxml.html 与损坏的 html 实体一起使用？

问题描述

1 个解决方案

解决方案1 2 已采纳 2019-06-16 21:03:57

解决方案1
2 已采纳 2019-06-16 21:03:57