简体   繁体   中英

BeautifulSoup doesn't parse XML loaded from local file

My Python script utilizing BeautifulSoup gets None when attempting to parse (find an element from) XML from a locally loaded file:

xmlData = None

with open('conf//test2.xml', 'r') as xmlFile:
    xmlData = xmlFile.read()

# this creates a soup object out of xmlData,
# which is properly loaded from file above
xmlSoup = BeautifulSoup(xmlData, "html.parser")

# this resolves to None
subElemX = xmlSoup.root.singleelement.find('subElementX', recursive=False)

The file:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<root>
    <singleElement>
        <subElementX>XYZ</subElementX>
    </singleElement>
    <repeatingElement id="1"/>
    <repeatingElement id="2"/>
</root>

I also have a REST GET service that returns the same XML but when I read that using requests.get , it is parsed fine:

resp = requests.get(serviceURL, headers=headers)

respXML = resp.content.decode("utf-8")

restSoup = BeautifulSoup(respXML, "html.parser")

Why does it work with the REST response and not with the data read out of a local file?

UPDATE: While I understand that python is case sensitive and single e lement !=single E lement, the case is disregarded when parsing the web service.

Two things to make it work:

  • change the features from html.parser to xml (you are parsing XML data, XML != HTML)
  • change singleelement to singleElement

Changes applied (works for me):

xmlSoup = BeautifulSoup(xmlData, "xml")

subElemX = xmlSoup.root.singleElement.find('subElementX', recursive=False)
print(subElemX)  # prints <subElementX>XYZ</subElementX>

Apparently, HTML is a case-insensitive language, so html.parser internally converts all tag names to lower case. Given that, the following line should work:

subElemX = xmlSoup.root.singleelement.find('subelementx', recursive=False)

But in general, you shouldn't parse XML documents with HTML parser. XML is quite strict about its syntax, and that's for a good reason.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM