简体   繁体   English

在python中使用lxml打印html实体

[英]printing html entities using lxml in python

I'm trying to make a div element from the below string with html entities. 我正在尝试使用html实体从下面的字符串中创建一个div元素。 Since my string contains html entities, & reserved char in the html entity is being escaped as & 由于我的字符串包含html实体,因此html实体中的& reserved char将被转义为& in the output. 在输出中。 Thus html entities are displayed as plain text. 因此,html实体显示为纯文本。 How can I avoid this so html entities are rendered properly? 我怎样才能避免这种情况,以便正确呈现html实体?

s = 'Actress Adamari López And Amgen Launch Spanish-Language Chemotherapy: Myths Or Facts™ Website And Resources'

div = etree.Element("div")
div.text = s

lxml.html.tostring(div)

output:
<div>Actress Adamari L&amp;#243;pez And Amgen Launch Spanish-Language Chemotherapy: Myths Or Facts&amp;#8482; Website And Resources</div>

You can specify encoding while calling tostring() : 您可以在调用tostring()指定encoding

>>> from lxml.html import fromstring, tostring
>>> s = 'Actress Adamari L&#243;pez And Amgen Launch Spanish-Language Chemotherapy: Myths Or Facts&#8482; Website And Resources'
>>> div = fromstring(s)
>>> print tostring(div, encoding='unicode')
<p>Actress Adamari López And Amgen Launch Spanish-Language Chemotherapy: Myths Or Facts™ Website And Resources</p>

As a side note, you should definitely use lxml.html.tostring() while dealing with HTML data: 作为旁注,在处理HTML数据时一定要使用lxml.html.tostring()

Note that you should use lxml.html.tostring and not lxml.tostring . 请注意,您应该使用lxml.html.tostring而不是lxml.tostring lxml.tostring(doc) will return the XML representation of the document, which is not valid HTML. lxml.tostring(doc)将返回lxml.tostring(doc)的XML表示形式,该表示形式不是有效的HTML。 In particular, things like <script src="..."></script> will be serialized as <script src="..." /> , which completely confuses browsers. 特别是像<script src="..."></script>会被序列化为<script src="..." /> ,这会完全混淆浏览器。

Also see: 另见:

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM