解析 Python 3 和 Beautiful Soup 4 中的 XML 命名空间

Question

我正在尝试用 Python 3 中的 BS4 解析 XML。

出于某种原因，我无法解析名称空间。 我试图在这个问题中寻找答案，但它对我不起作用，我也没有收到任何错误消息。

为什么第一部分有效，而第二部分无效？

import requests
from bs4 import BeautifulSoup

input = """
<?xml version="1.0" encoding="utf-8"?>
<wb:countries page="1" pages="6" per_page="50" total="299" xmlns:wb="http://www.worldbank.org">
  <wb:country id="ABW">
    <wb:iso2Code>AW</wb:iso2Code>
    <wb:name>Aruba</wb:name>
    <wb:region id="LCN" iso2code="ZJ">Latin America &amp; Caribbean </wb:region>
    <wb:adminregion id="" iso2code="" />
    <wb:incomeLevel id="HIC" iso2code="XD">High income</wb:incomeLevel>
    <wb:lendingType id="LNX" iso2code="XX">Not classified</wb:lendingType>
    <wb:capitalCity>Oranjestad</wb:capitalCity>
    <wb:longitude>-70.0167</wb:longitude>
    <wb:latitude>12.5167</wb:latitude>
  </wb:country>
  <wb:country id="AFE">
    <wb:iso2Code>ZH</wb:iso2Code>
    <wb:name>Africa Eastern and Southern</wb:name>
    <wb:region id="NA" iso2code="NA">Aggregates</wb:region>
    <wb:adminregion id="" iso2code="" />
    <wb:incomeLevel id="NA" iso2code="NA">Aggregates</wb:incomeLevel>
    <wb:lendingType id="" iso2code="">Aggregates</wb:lendingType>
    <wb:capitalCity />
    <wb:longitude />
    <wb:latitude />
  </wb:country>
</wb:countries>

<item>
  <title>Some string</title>
  <pubDate>Wed, 01 Sep 2022 12:45:00 +0000</pubDate>
  <guid isPermaLink="false">4574785</guid>
  <link>https://somesite.com</link>
  <itunes:subtitle>A subtitle</itunes:subtitle>
  <enclosure length="0" type="audio/mpeg" url="https://assets.somesite.com/123.mp3"/>
  <itunes:image href="https://somesite.com/img.jpg"/>
  <itunes:duration>7845</itunes:duration>
  <itunes:explicit>no</itunes:explicit>
  <itunes:episodeType>Full</itunes:episodeType>
</item>
"""

soup = BeautifulSoup(input, 'xml')

# Working
for x in soup.find_all('wb:country'):
    print(x.find('wb:name').text)

# Not working
for x in soup.find_all('item'):
    print(x.find('itunes:subtitle').text)

Answer 1

因为在 XML 解析器的严格模式下需要命名空间 - 使用lxml代替在这种狂野组合中获得预期结果：

soup = BeautifulSoup(input, 'lxml')

# Working
for x in soup.find_all('wb:country'):
    print(x.find('wb:name').text)

# also working
for x in soup.find_all('item'):
    print(x.find('itunes:subtitle').text)

Output

Aruba
Africa Eastern and Southern
A subtitle

解析 Python 3 和 Beautiful Soup 4 中的 XML 命名空间

问题描述

1 个解决方案

解决方案1
0 2022-09-18 15:12:31

Output

解析 Python 3 和 Beautiful Soup 4 中的 XML 命名空间

问题描述

1 个解决方案

解决方案1 0 2022-09-18 15:12:31

Output

解决方案1
0 2022-09-18 15:12:31