使用 beautifulsoup 抓取圖像時出錯

Question

原始代碼在這里： https : //github.com/amitabhadey/Web-Scraping-Images-using-Python-via-BeautifulSoup-/blob/master/code.py

因此，我正在嘗試調整 Python 腳本以從網站收集圖片，以便更好地進行網絡抓取。

我試圖從“ https://500px.com/editors ”獲取圖像

第一個錯誤是

導致此警告的代碼位於文件/Bureau/scrapper.py 的第 12 行。 要消除此警告，請將附加參數 'features="lxml"' 傳遞給 BeautifulSoup 構造函數。

所以我做了：

soup = BeautifulSoup(plain_text, features="lxml")

我還調整了類以反映 500px 中的標簽。

但是現在腳本停止運行，什么也沒發生。

最后它看起來像這樣：

import requests 
from bs4 import BeautifulSoup 
import urllib.request
import random 

url = "https://500px.com/editors"

source_code = requests.get(url)

plain_text = source_code.text

soup = BeautifulSoup(plain_text, features="lxml")

for link in soup.find_all("a",{"class":"photo_link "}):
    href = link.get('href')
    print(href)

    img_name = random.randrange(1,500)

    full_name = str(img_name) + ".jpg"

    urllib.request.urlretrieve(href, full_name)

    print("loop break")

我做錯了什么？

Answer 1

實際上，該網站是通過JavaScript使用XHR請求對以下API 加載的

所以你可以通過API直接訪問它。

請注意，您可以根據需要將參數rpp=50增加到任何數字，以獲得超過50結果。

import requests

r = requests.get("https://api.500px.com/v1/photos?rpp=50&feature=editors&image_size%5B%5D=1&image_size%5B%5D=2&image_size%5B%5D=32&image_size%5B%5D=31&image_size%5B%5D=33&image_size%5B%5D=34&image_size%5B%5D=35&image_size%5B%5D=36&image_size%5B%5D=2048&image_size%5B%5D=4&image_size%5B%5D=14&sort=&include_states=true&include_licensing=true&formats=jpeg%2Clytro&only=&exclude=&personalized_categories=&page=1&rpp=50").json()

for item in r['photos']:
    print(item['url'])

您也可以訪問圖像url本身以直接編寫它！

import requests

r = requests.get("https://api.500px.com/v1/photos?rpp=50&feature=editors&image_size%5B%5D=1&image_size%5B%5D=2&image_size%5B%5D=32&image_size%5B%5D=31&image_size%5B%5D=33&image_size%5B%5D=34&image_size%5B%5D=35&image_size%5B%5D=36&image_size%5B%5D=2048&image_size%5B%5D=4&image_size%5B%5D=14&sort=&include_states=true&include_licensing=true&formats=jpeg%2Clytro&only=&exclude=&personalized_categories=&page=1&rpp=50").json()

for item in r['photos']:
    print(item['image_url'][-1])

請注意， image_url key保存不同的img大小。 所以你可以選擇你喜歡的一個並保存它。 在這里，我采取了大的。

直接保存：

import requests

with requests.Session() as req:
    r = req.get("https://api.500px.com/v1/photos?rpp=50&feature=editors&image_size%5B%5D=1&image_size%5B%5D=2&image_size%5B%5D=32&image_size%5B%5D=31&image_size%5B%5D=33&image_size%5B%5D=34&image_size%5B%5D=35&image_size%5B%5D=36&image_size%5B%5D=2048&image_size%5B%5D=4&image_size%5B%5D=14&sort=&include_states=true&include_licensing=true&formats=jpeg%2Clytro&only=&exclude=&personalized_categories=&page=1&rpp=50").json()
    result = []
    for item in r['photos']:
        print(f"Downloading {item['name']}")
        save = req.get(item['image_url'][-1])
        name = save.headers.get("Content-Disposition")[9:]
        with open(name, 'wb') as f:
            f.write(save.content)

Answer 2

看着你試圖抓取的頁面，我注意到了一些東西。 直到頁面完成加載幾分鍾后，數據才會加載。 這告訴我他們正在使用 JS 框架在頁面加載后加載圖像。

你刮刀將與此頁面，是因為這樣的事實，它並不在它拉頁運行JS不工作。 運行您的腳本並打印出plain_text包含的內容證明了這一點：

<a class='photo_link {{#if hasDetailsTooltip}}px_tooltip{{/if}}' href='{{photoUrl}}'>

如果您查看該標簽上的href屬性，您會發現它實際上是 JS UI 框架使用的模板標簽。

您現在的選擇是查看他們正在調用哪些 API 來獲取這些數據（檢查網絡瀏覽器中的檢查器是否有網絡調用，如果幸運的話，它們可能不需要身份驗證）或使用在頁面上運行 JS 的工具. 我見過為此推薦的一種工具是selenium ，盡管我從未使用過它，所以我不完全了解它的功能； 我想圍繞這個的工具會大大增加你想要做的事情的復雜性。

使用 beautifulsoup 抓取圖像時出錯

問題描述

2 個解決方案

解決方案1
2 已采納 2020-01-07 17:08:35

解決方案2
0 2020-01-07 17:05:07

使用 beautifulsoup 抓取圖像時出錯

問題描述

2 個解決方案

解決方案1 2 已采納 2020-01-07 17:08:35

解決方案2 0 2020-01-07 17:05:07

解決方案1
2 已采納 2020-01-07 17:08:35

解決方案2
0 2020-01-07 17:05:07