I have an xml dataset tag in the following format:
<catchphrase "id=c0">unconscionable conduct</catchphrase>
I think when they made the dataset they didn't format the id attribute as it has to be:
<catchphrase id="c0">unconscionable conduct</catchphrase>
However, when this goes through Beautiful Soap lib in python it comes out as follows:
soup = BeautifulSoup(content, 'xml')
results in
<catchphrase>
"id=c0">application for leave to appeal
</catchphrase>
or
soup = BeautifulSoup(content, 'lxml')
results in
<html>
<body>
...
<catchphrase>
application for leave to appeal
</catchphrase>
....
I want to look like the second one but without the html and body tags (this is an XML document). I don't need the id attribute. I also use soup.prettify('utf-8')
before writing it in the file but I think it is already wrongly formatted when I do it.
There is no such standard way of doing this, but what you can do is replacing the faulty part with the correct way, something like this :
from bs4 import BeautifulSoup
content = '<catchphrase "id=c0">unconscionable conduct</catchphrase>'
soup = BeautifulSoup(content.replace('"id=', 'id="'), 'xml')
print soup
This results in :
<catchphrase id="c0">unconscionable conduct</catchphrase>
This is definitely a bit of a hack as there is no standard way to handle this mainly because XML is supposed to be correct before parsing by BeautifulSoup
.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.