[英]How can I find <img src> nested within <div> using Beautiful Soup?
New to both Python and Beautiful Soup. Python 和 Beautiful Soup 的新手。 I am trying to collect the
src
of an img
inserted into a collapsible section on an e-commerce site.我正在尝试收集插入电子商务网站可折叠部分的
img
的src
。 The collapsible sections that contain the images have the class of accordion__contents
, but <img>
inserted into the collapsible sections do not have a specific class
.包含图像的可折叠部分具有
accordion__contents
__contents 的 class ,但插入可折叠部分的<img>
没有特定的class
。 Not every page contains an image;并非每个页面都包含图像; some contain multiple.
有些包含多个。
I am trying to extract the src
from img
that are randomly nested within <div>
.我正在尝试从
img
中提取随机嵌套在<div>
中的src
。 In the HTML example below, my desired output would be: <[https://example.com/image1.png]>
在下面的 HTML 示例中,我想要的 output 将是:
<[https://example.com/image1.png]>
<div class="accordion__title">Description</div>
<div class="accordion__contents">
<p>Enjoy Daiya’s Hon’y Mustard Dressing on your salads</p>
</div>
<div class="accordion__title">Ingredients</div>
<div class="accordion__contents">
<p>Non-GMO Expeller Pressed Canola Oil, Filtered Water</p>
<p><strong>CONTAINS: MUSTARD</strong></p>
</div>
<div class="accordion__title">Nutrition</div>
<div class="accordion__contents">
<p>
<img alt="" class="alignnone size-medium wp-image-57054" height="300" src="https://example.com/image1.png" width="162"/>
</p>
</div>
<div class="accordion__title">Warnings</div>
<div class="accordion__contents">
<p><strong>Contains mustard</strong></p>
</div>
I've written the following code that successfully drills down to the full tag, but I can't figure out how to extract src
once I'm there.我编写了以下代码,成功深入到完整标签,但是一旦我在那里,我无法弄清楚如何提取
src
。
img_href = container.find_all(class_ ='accordion__contents') # generates the output above, in a list form
img_href = [img.find_all('img') for img in img_href]
for x in img_href:
if len(x)==0: # skip over empty items in the list that don't have images
continue
else:
print(x) # print to make sure the image is there
x.find('img')[`src`] # generates error - see below
The error I am getting is ResultSet object has no attribute 'find'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?
我得到的错误是
ResultSet object has no attribute 'find'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?
ResultSet object has no attribute 'find'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?
My intent is not to be treating a list like an item, thus the loop.我的意图不是将列表视为一个项目,因此是循环。 I've tried
find_all()
combined with .attrs('src')
but that also didn't work.我已经尝试
find_all()
与.attrs('src')
) 结合使用,但这也没有用。 What am I doing wrong?我究竟做错了什么?
I've simplified my example, but the URL for the page I'm scraping is here .我已经简化了我的示例,但是我正在抓取的页面的 URL 在这里。
You can use CSS selector ".accordion__contents img"
:您可以使用 CSS 选择器
".accordion__contents img"
:
import requests
from bs4 import BeautifulSoup
url = "https://gtfoitsvegan.com/product/hony-mustard-dressing-by-daiya/?v=7516fd43adaa"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
all_imgs = [img["src"] for img in soup.select(".accordion__contents img")]
print(all_imgs)
Prints:印刷:
['https://gtfoitsvegan.com/wp-content/uploads/2021/04/Daiya-Honey-Mustard-Nutrition-Facts-162x300.png']
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.