BeautifulSoup可以保存CDATA部分吗？

Question

I'm using BeautifulSoup to read, modify, and write an XML file. 我正在使用BeautifulSoup来读取，修改和编写XML文件。 I'm having trouble with CDATA sections being stripped out. 我正在解决CDATA部分被剥离的问题。 Here's a simplified example. 这是一个简化的例子。

The culprit XML file: 罪魁祸首XML文件：

<?xml version="1.0" ?>
<foo>
    <bar><![CDATA[
        !@#$%^&*()_+{}|:"<>?,./;'[]\-=
    ]]></bar>
</foo>

And here's the Python script. 这是Python脚本。

from bs4 import BeautifulSoup

xmlfile = open("cdata.xml", "r") 
soup = BeautifulSoup( xmlfile, "xml" )
print(soup)

Here's the output. 这是输出。 Note the CDATA section tags are missing. 请注意缺少CDATA部分标记。

<?xml version="1.0" encoding="utf-8"?>
<foo>
<bar>
        !@#$%^&amp;*()_+{}|:"&lt;&gt;?,./;'[]\-=
    </bar>
</foo>

I also tried printing soup.prettify(formatter="xml") and got the same result with slightly different whitespace. 我也试过打印soup.prettify(formatter="xml")并得到相同的结果，空格略有不同。 There isn't much in the docs about reading in CDATA sections, so maybe this is an lxml thing? 关于在CDATA部分阅读的文档中没有太多内容，所以这可能是一个lxml东西？

Is there a way to tell BeautifulSoup to preserve CDATA sections? 有没有办法告诉BeautifulSoup保留CDATA部分？

Update Yes, it's an lxml thing. 更新是的，这是一个lxml的事情。 http://lxml.de/api.html#cdata So, the question becomes, is it possible to tell BeautifulSoup to initialize lxml with strip_cdata=False ? http://lxml.de/api.html#cdata所以，问题变成了，是否有可能告诉BeautifulSoup用strip_cdata=False初始化lxml？

Answer 1

In my case if I use 在我的情况下，如果我使用

soup = BeautifulSoup( xmlfile, "lxml-xml" )

then cdata is preserved and accesible. 然后cdata被保留和可访问。

BeautifulSoup可以保存CDATA部分吗？

问题描述

1 个解决方案

解决方案1
4 2015-12-26 21:00:23

BeautifulSoup可以保存CDATA部分吗？

问题描述

1 个解决方案

解决方案1 4 2015-12-26 21:00:23

解决方案1
4 2015-12-26 21:00:23