简体   繁体   English

python中的漂亮汤xml格式

[英]Beautiful soup xml formatting in python

I have an xml dataset tag in the following format: 我有以下格式的xml数据集标记:

<catchphrase "id=c0">unconscionable conduct</catchphrase>

I think when they made the dataset they didn't format the id attribute as it has to be: 我认为当他们制作数据集时,他们并没有格式化id属性,因为它必须是:

<catchphrase id="c0">unconscionable conduct</catchphrase>

However, when this goes through Beautiful Soap lib in python it comes out as follows: 但是,当通过python中的Beautiful Soap lib进行处理时,结果如下:

 soup = BeautifulSoup(content, 'xml')

results in 结果是

 <catchphrase>
   "id=c0"&gt;application for leave to appeal
  </catchphrase>

or 要么

soup = BeautifulSoup(content, 'lxml')

results in 结果是

<html>
   <body>
    ...
     <catchphrase>
         application for leave to appeal
     </catchphrase>
    ....

I want to look like the second one but without the html and body tags (this is an XML document). 我想看起来像第二个,但没有html和body标记(这是一个XML文档)。 I don't need the id attribute. 我不需要id属性。 I also use soup.prettify('utf-8') before writing it in the file but I think it is already wrongly formatted when I do it. 在将它写入文件之前,我还使用了soup.prettify('utf-8') ,但我认为在执行该操作时它已经被错误地格式化了。

There is no such standard way of doing this, but what you can do is replacing the faulty part with the correct way, something like this : 没有这样的标准方法,但是您可以做的是用正确的方法替换有问题的部分,如下所示:

from bs4 import BeautifulSoup
content = '<catchphrase "id=c0">unconscionable conduct</catchphrase>'

soup = BeautifulSoup(content.replace('"id=', 'id="'), 'xml')
print soup

This results in : 结果是:

<catchphrase id="c0">unconscionable conduct</catchphrase>

This is definitely a bit of a hack as there is no standard way to handle this mainly because XML is supposed to be correct before parsing by BeautifulSoup . 这绝对是个小问题,因为没有标准的方法来处理此问题,这主要是因为在BeautifulSoup解析之前XML应该是正确的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM