在python中使用lxml打印html实体

Question

I'm trying to make a div element from the below string with html entities. 我正在尝试使用html实体从下面的字符串中创建一个div元素。 Since my string contains html entities, & reserved char in the html entity is being escaped as & 由于我的字符串包含html实体，因此html实体中的& reserved char将被转义为& in the output. 在输出中。 Thus html entities are displayed as plain text. 因此，html实体显示为纯文本。 How can I avoid this so html entities are rendered properly? 我怎样才能避免这种情况，以便正确呈现html实体？

s = 'Actress Adamari L&#243;pez And Amgen Launch Spanish-Language Chemotherapy: Myths Or Facts&#8482; Website And Resources'

div = etree.Element("div")
div.text = s

lxml.html.tostring(div)

output:
<div>Actress Adamari L&amp;#243;pez And Amgen Launch Spanish-Language Chemotherapy: Myths Or Facts&amp;#8482; Website And Resources</div>

Answer 1

You can specify encoding while calling tostring() : 您可以在调用tostring()指定encoding ：

>>> from lxml.html import fromstring, tostring
>>> s = 'Actress Adamari L&#243;pez And Amgen Launch Spanish-Language Chemotherapy: Myths Or Facts&#8482; Website And Resources'
>>> div = fromstring(s)
>>> print tostring(div, encoding='unicode')
<p>Actress Adamari López And Amgen Launch Spanish-Language Chemotherapy: Myths Or Facts™ Website And Resources</p>

As a side note, you should definitely use lxml.html.tostring() while dealing with HTML data: 作为旁注，在处理HTML数据时一定要使用lxml.html.tostring() ：

Note that you should use lxml.html.tostring and not lxml.tostring . 请注意，您应该使用lxml.html.tostring而不是lxml.tostring 。 lxml.tostring(doc) will return the XML representation of the document, which is not valid HTML. lxml.tostring(doc)将返回lxml.tostring(doc)的XML表示形式，该表示形式不是有效的HTML。 In particular, things like <script src="..."></script> will be serialized as <script src="..." /> , which completely confuses browsers. 特别是像<script src="..."></script>会被序列化为<script src="..." /> ，这会完全混淆浏览器。

Also see: 另见：

Serialising to Unicode strings 序列化为Unicode字符串

在python中使用lxml打印html实体

问题描述

1 个解决方案

解决方案1
3 2014-12-07 06:20:04

在python中使用lxml打印html实体

问题描述

1 个解决方案

解决方案1 3 2014-12-07 06:20:04

解决方案1
3 2014-12-07 06:20:04