使用 BeautifulSoup 通过迭代检索属性值

Question

I'm scraping an html saved on a file with the following code:我正在使用以下代码抓取保存在文件中的 html：

from bs4 import BeautifulSoup as bs

path_xml = r"..."

content = []

with open(path_xml, "r") as file:
    content = file.readlines()

content = "".join(content)
bs_content = bs(content, "html.parser")

bilder = bs_content.find_all("bilder")

def get_str_bild(match):
    test = match.findChildren("b")

    for x in range(len(test)): # here is the problem (not giving me all elements in test)
 
        return test[x].get("d")

for b in bilder:
    if b.b: 
        print(get_str_bild(b))

Output: Output：

L3357U00_002120.jpg
L3357U00_002140.jpg
L3357U00_002160.jpg

Basically, there are 3 positions in the xml file where I have children of the node " bilder ".基本上，在 xml 文件中有 3 个位置，我有节点“ bilder ”的子节点。 Each block looks like this:每个块看起来像这样：

<Bilder>
    <B Nr="1" D="L3357U00_002120.jpg"/>
    <B Nr="2" D="L3357U00_002120.jpg"/>
    <B Nr="3" D="L3357U00_002120.jpg"/>
    <B Nr="4" D="L3357U00_002120.jpg"/>
    <B Nr="9" D="L3357U00_002120.jpg"/>
    <B Nr="1" D="L3357U00_002130.jpg"/>
    <B Nr="2" D="L3357U00_002130.jpg"/>
    <B Nr="3" D="L3357U00_002130.jpg"/>
    <B Nr="4" D="L3357U00_002130.jpg"/>
    <B Nr="9" D="L3357U00_002130.jpg"/>
</Bilder>

Currently it only returns the first picture of each block and I want to return all of them.目前它只返回每个块的第一张图片，我想返回所有这些图片。

What am I doing wrong here?我在这里做错了什么？

Answer 1

You need to fix get_str_bild(match) function. It currently returns the first d attribute.您需要修复get_str_bild(match) function。它当前返回第一个d属性。

Replace you function with this:将您的 function 替换为：

def get_str_bild(match):
    test = match.find_all("b")
    
    elements = []
    for x in range(len(test)):
        elements.append(test[x].get("d"))

    return elements

Answer 2

You're missing the cycle on bs of your bilders.您错过了 bs of your biders 的循环。 You can remove your function and simplify your code as follows:您可以删除 function 并简化代码，如下所示：

pic_1 = "L3357U00_002120.jpg"

bs_content = bs(content, "html.parser")
for i, builder in enumerate(bs_content.find_all("bilder")):
    print(f'builder {i}')
    for b in bilder.find_all('b'):
        if b['nr'] == pic_1:
            print(b['d'])
            #break

使用 BeautifulSoup 通过迭代检索属性值

问题描述

2 个解决方案

解决方案1
0 2023-01-28 13:54:52

解决方案2
0 已采纳 2023-01-28 14:04:43

使用 BeautifulSoup 通过迭代检索属性值

问题描述

2 个解决方案

解决方案1 0 2023-01-28 13:54:52

解决方案2 0 已采纳 2023-01-28 14:04:43

解决方案1
0 2023-01-28 13:54:52

解决方案2
0 已采纳 2023-01-28 14:04:43