[英]Can CDATA sections be preserved by BeautifulSoup?
I'm using BeautifulSoup to read, modify, and write an XML file. 我正在使用BeautifulSoup来读取,修改和编写XML文件。 I'm having trouble with CDATA sections being stripped out.
我正在解决CDATA部分被剥离的问题。 Here's a simplified example.
这是一个简化的例子。
The culprit XML file: 罪魁祸首XML文件:
<?xml version="1.0" ?>
<foo>
<bar><![CDATA[
!@#$%^&*()_+{}|:"<>?,./;'[]\-=
]]></bar>
</foo>
And here's the Python script. 这是Python脚本。
from bs4 import BeautifulSoup
xmlfile = open("cdata.xml", "r")
soup = BeautifulSoup( xmlfile, "xml" )
print(soup)
Here's the output. 这是输出。 Note the CDATA section tags are missing.
请注意缺少CDATA部分标记。
<?xml version="1.0" encoding="utf-8"?>
<foo>
<bar>
!@#$%^&*()_+{}|:"<>?,./;'[]\-=
</bar>
</foo>
I also tried printing soup.prettify(formatter="xml")
and got the same result with slightly different whitespace. 我也试过打印
soup.prettify(formatter="xml")
并得到相同的结果,空格略有不同。 There isn't much in the docs about reading in CDATA sections, so maybe this is an lxml
thing? 关于在CDATA部分阅读的文档中没有太多内容,所以这可能是一个
lxml
东西?
Is there a way to tell BeautifulSoup to preserve CDATA sections? 有没有办法告诉BeautifulSoup保留CDATA部分?
Update Yes, it's an lxml thing. 更新是的,这是一个lxml的事情。 http://lxml.de/api.html#cdata So, the question becomes, is it possible to tell BeautifulSoup to initialize lxml with
strip_cdata=False
? http://lxml.de/api.html#cdata所以,问题变成了,是否有可能告诉BeautifulSoup用
strip_cdata=False
初始化lxml?
In my case if I use 在我的情况下,如果我使用
soup = BeautifulSoup( xmlfile, "lxml-xml" )
then cdata is preserved and accesible. 然后cdata被保留和可访问。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.