[英]Beautiful soup xml formatting in python
I have an xml dataset tag in the following format: 我有以下格式的xml数据集标记:
<catchphrase "id=c0">unconscionable conduct</catchphrase>
I think when they made the dataset they didn't format the id attribute as it has to be: 我认为当他们制作数据集时,他们并没有格式化id属性,因为它必须是:
<catchphrase id="c0">unconscionable conduct</catchphrase>
However, when this goes through Beautiful Soap lib in python it comes out as follows: 但是,当通过python中的Beautiful Soap lib进行处理时,结果如下:
soup = BeautifulSoup(content, 'xml')
results in 结果是
<catchphrase>
"id=c0">application for leave to appeal
</catchphrase>
or 要么
soup = BeautifulSoup(content, 'lxml')
results in 结果是
<html>
<body>
...
<catchphrase>
application for leave to appeal
</catchphrase>
....
I want to look like the second one but without the html and body tags (this is an XML document). 我想看起来像第二个,但没有html和body标记(这是一个XML文档)。 I don't need the id attribute.
我不需要id属性。 I also use
soup.prettify('utf-8')
before writing it in the file but I think it is already wrongly formatted when I do it. 在将它写入文件之前,我还使用了
soup.prettify('utf-8')
,但我认为在执行该操作时它已经被错误地格式化了。
There is no such standard way of doing this, but what you can do is replacing the faulty part with the correct way, something like this : 没有这样的标准方法,但是您可以做的是用正确的方法替换有问题的部分,如下所示:
from bs4 import BeautifulSoup
content = '<catchphrase "id=c0">unconscionable conduct</catchphrase>'
soup = BeautifulSoup(content.replace('"id=', 'id="'), 'xml')
print soup
This results in : 结果是:
<catchphrase id="c0">unconscionable conduct</catchphrase>
This is definitely a bit of a hack as there is no standard way to handle this mainly because XML is supposed to be correct before parsing by BeautifulSoup
. 这绝对是个小问题,因为没有标准的方法来处理此问题,这主要是因为在
BeautifulSoup
解析之前XML应该是正确的。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.