簡體   English   中英

無法使用 BeautifulSoup 刮取圖像

[英]Can't scrape image with BeautifulSoup

我正在嘗試從市場上抓取圖像,但我認為奇怪的 class 標簽妨礙了我。 這是我試圖刮掉的 HTML :

HTML

當我運行這個片段時:

import requests
from bs4 import BeautifulSoup
url = 'https://www.americanas.com.br/produto/134231584?pfm_carac=Aspirador%20de%20P%C3%B3%20Vertical&pfm_page=category&pfm_pos=grid&pfm_type=vit_product_grid&voltagem=110V'

headers = {'User-Agent': 'whatever'}
response = requests.get(url, headers=headers)
html = response.content
bs = BeautifulSoup(html, "lxml")
bs.find('div', class_='src__Wrapper-xr9q25-1 fwzdjF')

我得到這個結果: <div class="src__Wrapper-xr9q25-1 fwzdjF"></div> 沒有更多內容可供抓取

如果我嘗試抓取圖片標簽,則不會發生任何事情:

>>> bs.find('picture', class_="src__Picture-xr9q25-2 gKwsnn")

有人知道在這里做什么嗎?

圖像通過 JavaScript 動態加載,但您可以使用此示例通過jsonre模塊獲取它:

import re
import json
import requests

url = 'https://www.americanas.com.br/produto/134231584?pfm_carac=Aspirador%20de%20P%C3%B3%20Vertical&pfm_page=category&pfm_pos=grid&pfm_type=vit_product_grid&voltagem=110V'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:86.0) Gecko/20100101 Firefox/86.0'}
data = json.loads( re.search(r'window\.__APOLLO_STATE__ = (.*)</script>', requests.get(url, headers=headers).text ).group(1) )


def find_images(data):
    if isinstance(data, dict):
        for k, v in data.items():
            if k == 'images':
                yield v
            else:
                yield from find_images(v)
    elif isinstance(data, list):
        for v in data:
            yield from find_images(v)


images = next(find_images(data))

for image in images:
    print(image['extraLarge'])

印刷:

https://images-americanas.b2w.io/produtos/01/00/img/134231/5/134231592_1SZ.jpg
https://images-americanas.b2w.io/produtos/01/00/img/134231/5/134231592_2SZ.jpg
https://images-americanas.b2w.io/produtos/01/00/img/134231/5/134231592_3SZ.jpg

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM