简体   繁体   中英

Parsing XML namespaces in Python 3 and Beautiful Soup 4

I am trying to parse XML with BS4 in Python 3.

For some reason, I am not able to parse namespaces. I tried to look for answers in this question, but it doesn't work for me and I don't get any error message either.

Why does the first part work, but the second does not?

import requests
from bs4 import BeautifulSoup

input = """
<?xml version="1.0" encoding="utf-8"?>
<wb:countries page="1" pages="6" per_page="50" total="299" xmlns:wb="http://www.worldbank.org">
  <wb:country id="ABW">
    <wb:iso2Code>AW</wb:iso2Code>
    <wb:name>Aruba</wb:name>
    <wb:region id="LCN" iso2code="ZJ">Latin America &amp; Caribbean </wb:region>
    <wb:adminregion id="" iso2code="" />
    <wb:incomeLevel id="HIC" iso2code="XD">High income</wb:incomeLevel>
    <wb:lendingType id="LNX" iso2code="XX">Not classified</wb:lendingType>
    <wb:capitalCity>Oranjestad</wb:capitalCity>
    <wb:longitude>-70.0167</wb:longitude>
    <wb:latitude>12.5167</wb:latitude>
  </wb:country>
  <wb:country id="AFE">
    <wb:iso2Code>ZH</wb:iso2Code>
    <wb:name>Africa Eastern and Southern</wb:name>
    <wb:region id="NA" iso2code="NA">Aggregates</wb:region>
    <wb:adminregion id="" iso2code="" />
    <wb:incomeLevel id="NA" iso2code="NA">Aggregates</wb:incomeLevel>
    <wb:lendingType id="" iso2code="">Aggregates</wb:lendingType>
    <wb:capitalCity />
    <wb:longitude />
    <wb:latitude />
  </wb:country>
</wb:countries>

<item>
  <title>Some string</title>
  <pubDate>Wed, 01 Sep 2022 12:45:00 +0000</pubDate>
  <guid isPermaLink="false">4574785</guid>
  <link>https://somesite.com</link>
  <itunes:subtitle>A subtitle</itunes:subtitle>
  <enclosure length="0" type="audio/mpeg" url="https://assets.somesite.com/123.mp3"/>
  <itunes:image href="https://somesite.com/img.jpg"/>
  <itunes:duration>7845</itunes:duration>
  <itunes:explicit>no</itunes:explicit>
  <itunes:episodeType>Full</itunes:episodeType>
</item>
"""

soup = BeautifulSoup(input, 'xml')

# Working
for x in soup.find_all('wb:country'):
    print(x.find('wb:name').text)

# Not working
for x in soup.find_all('item'):
    print(x.find('itunes:subtitle').text)

Because a namespace is expected in strict mode of XML parser - Use lxml instead to get your expected result in this wild mix:

soup = BeautifulSoup(input, 'lxml')

# Working
for x in soup.find_all('wb:country'):
    print(x.find('wb:name').text)

# also working
for x in soup.find_all('item'):
    print(x.find('itunes:subtitle').text)

Output

Aruba
Africa Eastern and Southern
A subtitle

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM