[英]printing html entities using lxml in python
I'm trying to make a div element from the below string with html entities. 我正在尝试使用html实体从下面的字符串中创建一个div元素。 Since my string contains html entities, &
reserved char in the html entity is being escaped as &
由于我的字符串包含html实体,因此html实体中的&
reserved char将被转义为&
in the output. 在输出中。 Thus html entities are displayed as plain text. 因此,html实体显示为纯文本。 How can I avoid this so html entities are rendered properly? 我怎样才能避免这种情况,以便正确呈现html实体?
s = 'Actress Adamari López And Amgen Launch Spanish-Language Chemotherapy: Myths Or Facts™ Website And Resources'
div = etree.Element("div")
div.text = s
lxml.html.tostring(div)
output:
<div>Actress Adamari L&#243;pez And Amgen Launch Spanish-Language Chemotherapy: Myths Or Facts&#8482; Website And Resources</div>
You can specify encoding
while calling tostring()
: 您可以在调用tostring()
指定encoding
:
>>> from lxml.html import fromstring, tostring
>>> s = 'Actress Adamari López And Amgen Launch Spanish-Language Chemotherapy: Myths Or Facts™ Website And Resources'
>>> div = fromstring(s)
>>> print tostring(div, encoding='unicode')
<p>Actress Adamari López And Amgen Launch Spanish-Language Chemotherapy: Myths Or Facts™ Website And Resources</p>
As a side note, you should definitely use lxml.html.tostring()
while dealing with HTML
data: 作为旁注,在处理HTML
数据时一定要使用lxml.html.tostring()
:
Note that you should use
lxml.html.tostring
and notlxml.tostring
. 请注意,您应该使用lxml.html.tostring
而不是lxml.tostring
。lxml.tostring(doc)
will return the XML representation of the document, which is not valid HTML.lxml.tostring(doc)
将返回lxml.tostring(doc)
的XML表示形式,该表示形式不是有效的HTML。 In particular, things like<script src="..."></script>
will be serialized as<script src="..." />
, which completely confuses browsers. 特别是像<script src="..."></script>
会被序列化为<script src="..." />
,这会完全混淆浏览器。
Also see: 另见:
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.