[英]Can't scrape image with BeautifulSoup
我正在嘗試從市場上抓取圖像,但我認為奇怪的 class 標簽妨礙了我。 這是我試圖刮掉的 HTML :
當我運行這個片段時:
import requests
from bs4 import BeautifulSoup
url = 'https://www.americanas.com.br/produto/134231584?pfm_carac=Aspirador%20de%20P%C3%B3%20Vertical&pfm_page=category&pfm_pos=grid&pfm_type=vit_product_grid&voltagem=110V'
headers = {'User-Agent': 'whatever'}
response = requests.get(url, headers=headers)
html = response.content
bs = BeautifulSoup(html, "lxml")
bs.find('div', class_='src__Wrapper-xr9q25-1 fwzdjF')
我得到這個結果: <div class="src__Wrapper-xr9q25-1 fwzdjF"></div>
。 沒有更多內容可供抓取。
如果我嘗試抓取圖片標簽,則不會發生任何事情:
>>> bs.find('picture', class_="src__Picture-xr9q25-2 gKwsnn")
有人知道在這里做什么嗎?
圖像通過 JavaScript 動態加載,但您可以使用此示例通過json
和re
模塊獲取它:
import re
import json
import requests
url = 'https://www.americanas.com.br/produto/134231584?pfm_carac=Aspirador%20de%20P%C3%B3%20Vertical&pfm_page=category&pfm_pos=grid&pfm_type=vit_product_grid&voltagem=110V'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:86.0) Gecko/20100101 Firefox/86.0'}
data = json.loads( re.search(r'window\.__APOLLO_STATE__ = (.*)</script>', requests.get(url, headers=headers).text ).group(1) )
def find_images(data):
if isinstance(data, dict):
for k, v in data.items():
if k == 'images':
yield v
else:
yield from find_images(v)
elif isinstance(data, list):
for v in data:
yield from find_images(v)
images = next(find_images(data))
for image in images:
print(image['extraLarge'])
印刷:
https://images-americanas.b2w.io/produtos/01/00/img/134231/5/134231592_1SZ.jpg
https://images-americanas.b2w.io/produtos/01/00/img/134231/5/134231592_2SZ.jpg
https://images-americanas.b2w.io/produtos/01/00/img/134231/5/134231592_3SZ.jpg
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.