获取特定标签 - BeautifulSoup

Question

Below is the xml that I'm trying to parse.下面是我要解析的 xml。

<url>
   <loc>https://www.houseofindya.com/aqua-chanderi-pleated-sharara-pants-177/iprdt</loc>
   <image:image>
      <image:loc>https://img.faballey.com/Images/Product/IPL00325Z/d3.jpg</image:loc>
      <image:title>Green Chanderi Pleated Sharara Pants</image:title>
   </image:image>
   <priority>0.8</priority>
   <changefreq>daily</changefreq>
</url>
<url>
   <loc>https://www.houseofindya.com/aqua-foil-chanderi-kurta-171/iprdt</loc>
   <image:image>
      <image:loc>https://img.faballey.com/Images/Product/ITN01710Z/d3.jpg</image:loc>
      <image:title>Aqua Foil Chanderi Kurta</image:title>
   </image:image>
   <priority>0.8</priority>
   <changefreq>daily</changefreq>
</url>

I need to get text of only <loc> tags.我只需要获取<loc>标签的文本。 So, I do the following:-因此，我执行以下操作：-

soup = BeautifulSoup(xml, features='xml')
loc = soup.find('loc')
while loc is not None:
    url = loc.text
    yield url
    loc = loc.find_next('loc')

The result I get is我得到的结果是

https://www.houseofindya.com/aqua-chanderi-pleated-sharara-pants-177/iprdt
https://img.faballey.com/Images/Product/IPL00325Z/d3.jpg
https://www.houseofindya.com/aqua-foil-chanderi-kurta-171/iprdt
https://img.faballey.com/Images/Product/ITN01710Z/d3.jpg

However, what I want is only https://www.houseofindya.com/aqua-chanderi-pleated-sharara-pants-177/iprdt , and https://www.houseofindya.com/aqua-foil-chanderi-kurta-171/iprdt .但是，我想要的只是https://www.houseofindya.com/aqua-chanderi-pleated-sharara-pants-177/iprdt和https://www.houseofindya.com/aqua-foil-chanderi-kurta-171/iprdt 。 I don't want the text of <image:loc> .我不想要<image:loc>的文本。

What am I missing here?我在这里想念什么？

Answer 1

You can use CSS selectors even in XML document, so select all url > loc :您甚至可以在 XML 文档中使用CSS选择器，因此选择所有url > loc ：

from bs4 import BeautifulSoup

xml_doc = """
... your XML from question here ...
"""

soup = BeautifulSoup(xml_doc, "html.parser")

for loc in soup.select("url > loc"):
    print(loc.text)

Prints:印刷：

https://www.houseofindya.com/aqua-chanderi-pleated-sharara-pants-177/iprdt
https://www.houseofindya.com/aqua-foil-chanderi-kurta-171/iprdt

获取特定标签 - BeautifulSoup

问题描述

1 个解决方案

解决方案1
0 2022-05-14 19:53:54

获取特定标签 - BeautifulSoup

问题描述

1 个解决方案

解决方案1 0 2022-05-14 19:53:54

解决方案1
0 2022-05-14 19:53:54