無法使用 BeautifulSoup 刮取圖像

Question

我正在嘗試從市場上抓取圖像，但我認為奇怪的 class 標簽妨礙了我。 這是我試圖刮掉的 HTML ：

當我運行這個片段時：

import requests
from bs4 import BeautifulSoup
url = 'https://www.americanas.com.br/produto/134231584?pfm_carac=Aspirador%20de%20P%C3%B3%20Vertical&pfm_page=category&pfm_pos=grid&pfm_type=vit_product_grid&voltagem=110V'

headers = {'User-Agent': 'whatever'}
response = requests.get(url, headers=headers)
html = response.content
bs = BeautifulSoup(html, "lxml")
bs.find('div', class_='src__Wrapper-xr9q25-1 fwzdjF')

我得到這個結果： <div class="src__Wrapper-xr9q25-1 fwzdjF"></div> 。 沒有更多內容可供抓取。

如果我嘗試抓取圖片標簽，則不會發生任何事情：

>>> bs.find('picture', class_="src__Picture-xr9q25-2 gKwsnn")

有人知道在這里做什么嗎？

Answer 1

圖像通過 JavaScript 動態加載，但您可以使用此示例通過json和re模塊獲取它：

import re
import json
import requests

url = 'https://www.americanas.com.br/produto/134231584?pfm_carac=Aspirador%20de%20P%C3%B3%20Vertical&pfm_page=category&pfm_pos=grid&pfm_type=vit_product_grid&voltagem=110V'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:86.0) Gecko/20100101 Firefox/86.0'}
data = json.loads( re.search(r'window\.__APOLLO_STATE__ = (.*)</script>', requests.get(url, headers=headers).text ).group(1) )


def find_images(data):
    if isinstance(data, dict):
        for k, v in data.items():
            if k == 'images':
                yield v
            else:
                yield from find_images(v)
    elif isinstance(data, list):
        for v in data:
            yield from find_images(v)


images = next(find_images(data))

for image in images:
    print(image['extraLarge'])

印刷：

https://images-americanas.b2w.io/produtos/01/00/img/134231/5/134231592_1SZ.jpg
https://images-americanas.b2w.io/produtos/01/00/img/134231/5/134231592_2SZ.jpg
https://images-americanas.b2w.io/produtos/01/00/img/134231/5/134231592_3SZ.jpg

無法使用 BeautifulSoup 刮取圖像

問題描述

1 個解決方案

解決方案1
1 已采納 2021-03-09 20:05:58

無法使用 BeautifulSoup 刮取圖像

問題描述

1 個解決方案

解決方案1 1 已采納 2021-03-09 20:05:58

解決方案1
1 已采納 2021-03-09 20:05:58