python中的漂亮汤xml格式

Question

I have an xml dataset tag in the following format: 我有以下格式的xml数据集标记：

<catchphrase "id=c0">unconscionable conduct</catchphrase>

I think when they made the dataset they didn't format the id attribute as it has to be: 我认为当他们制作数据集时，他们并没有格式化id属性，因为它必须是：

<catchphrase id="c0">unconscionable conduct</catchphrase>

However, when this goes through Beautiful Soap lib in python it comes out as follows: 但是，当通过python中的Beautiful Soap lib进行处理时，结果如下：

 soup = BeautifulSoup(content, 'xml')

results in 结果是

 <catchphrase>
   "id=c0"&gt;application for leave to appeal
  </catchphrase>

or 要么

soup = BeautifulSoup(content, 'lxml')

results in 结果是

<html>
   <body>
    ...
     <catchphrase>
         application for leave to appeal
     </catchphrase>
    ....

I want to look like the second one but without the html and body tags (this is an XML document). 我想看起来像第二个，但没有html和body标记（这是一个XML文档）。 I don't need the id attribute. 我不需要id属性。 I also use soup.prettify('utf-8') before writing it in the file but I think it is already wrongly formatted when I do it. 在将它写入文件之前，我还使用了soup.prettify('utf-8') ，但我认为在执行该操作时它已经被错误地格式化了。

Answer 1

There is no such standard way of doing this, but what you can do is replacing the faulty part with the correct way, something like this : 没有这样的标准方法，但是您可以做的是用正确的方法替换有问题的部分，如下所示：

from bs4 import BeautifulSoup
content = '<catchphrase "id=c0">unconscionable conduct</catchphrase>'

soup = BeautifulSoup(content.replace('"id=', 'id="'), 'xml')
print soup

This results in : 结果是：

<catchphrase id="c0">unconscionable conduct</catchphrase>

This is definitely a bit of a hack as there is no standard way to handle this mainly because XML is supposed to be correct before parsing by BeautifulSoup . 这绝对是个小问题，因为没有标准的方法来处理此问题，这主要是因为在BeautifulSoup解析之前XML应该是正确的。

python中的漂亮汤xml格式

问题描述

1 个解决方案

解决方案1
2 已采纳 2017-03-30 18:28:13

python中的漂亮汤xml格式

问题描述

1 个解决方案

解决方案1 2 已采纳 2017-03-30 18:28:13

解决方案1
2 已采纳 2017-03-30 18:28:13