简体   繁体   中英

Empty element error with Beautiful Soup

I am parsing an xml file using Beautiful Soup but have found inconsistent behaviour when parsing empty elements. Ie

from BeautifulSoup import BeautifulSoup
s1 = "<c><a /><b /></c>"
s2 = "<c><a></a><b></b></c>"
soup1 = BeautifulSoup(s1)
soup2 = BeautifulSoup(s2)
print soup1
# <c><a><b></b></a></c>
print soup2
# <c><a></a><b></b></c>

Note that the b tag is inside the a tag in the first case, but not in the second. I thought that the XML spec meant that s1 and s2 were equivalent?

Any thoughts as to how I can deal with this?

The anchor and bold ( <a> , <b> ) elements can not be self-closed, so this is invalid XHTML.

On top of that, the XHTML spec says a space must lead the slash:

Include a space before the trailing / and > of empty elements, eg <br />, <hr /> and <img src="karen.jpg" alt="Karen" />. Also, use the minimized tag syntax for empty elements, eg <br />, as the alternative syntax <br></br> allowed by XML gives uncertain results in many existing user agents.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM