简体   繁体   中英

Beautiful soup xml formatting in python

I have an xml dataset tag in the following format:

<catchphrase "id=c0">unconscionable conduct</catchphrase>

I think when they made the dataset they didn't format the id attribute as it has to be:

<catchphrase id="c0">unconscionable conduct</catchphrase>

However, when this goes through Beautiful Soap lib in python it comes out as follows:

 soup = BeautifulSoup(content, 'xml')

results in

 <catchphrase>
   "id=c0"&gt;application for leave to appeal
  </catchphrase>

or

soup = BeautifulSoup(content, 'lxml')

results in

<html>
   <body>
    ...
     <catchphrase>
         application for leave to appeal
     </catchphrase>
    ....

I want to look like the second one but without the html and body tags (this is an XML document). I don't need the id attribute. I also use soup.prettify('utf-8') before writing it in the file but I think it is already wrongly formatted when I do it.

There is no such standard way of doing this, but what you can do is replacing the faulty part with the correct way, something like this :

from bs4 import BeautifulSoup
content = '<catchphrase "id=c0">unconscionable conduct</catchphrase>'

soup = BeautifulSoup(content.replace('"id=', 'id="'), 'xml')
print soup

This results in :

<catchphrase id="c0">unconscionable conduct</catchphrase>

This is definitely a bit of a hack as there is no standard way to handle this mainly because XML is supposed to be correct before parsing by BeautifulSoup .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM