從動態加載的頁面中抓取網頁內容（無限滾動）

Question

我正在嘗試從該網站收集所有圖像文件名： https ://www.shipspotting.com/

我已經收集了所有類別名稱及其 ID 號的 python dict cat_dict 。 所以我的策略是遍歷每個類別頁面，調用數據加載 API 並保存每個頁面的響應。

我已將https://www.shipspotting.com/ssapi/gallery-search確定為加載下一頁內容的請求 URL。 但是，當我使用 requests 庫請求此 URL 時，我得到一個 404。在加載下一頁內容時，我需要做什么才能獲得正確的響應？

import requests
from bs4 import BeautifulSoup

cat_page = 'https://www.shipspotting.com/photos/gallery?category='

for cat in cat_dict:
   cat_link = cat_page + str(cat_dict[cat])
   headers = {
   "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:96.0) Gecko/20100101 Firefox/96.0",
   "Referer": cat_link
}

   response = requests.get('https://www.shipspotting.com/ssapi/gallery-search', headers=headers)
   soup = BeautifulSoup(response.text, 'html.parser')

https://www.shipspotting.com/photos/gallery?category=169是一個示例頁面（ cat_link ）

Answer 1

每次向下滾動頁面時，都會向服務器發出一個新請求（一個帶有特定負載的 POST 請求）。 您可以在開發工具的網絡選項卡中驗證這一點。

以下作品：

import requests
from bs4 import BeautifulSoup

### put the following code in a for loop based on a number of pages 
### [total number of ship photos]/[12], or async it ... your choice

data = {"category":"","perPage":12,"page":2} 
r = requests.post('https://www.shipspotting.com/ssapi/gallery-search', data = data)
print(r.json())

這將返回一個 json 響應：

{'page': 1, 'items': [{'lid': 3444123, 'cid': 172, 'title': 'ELLBING II', 'imo_no': '0000000',....}

Answer 2

您識別的 url 包含通過 API 作為 post 方法的數據，並且無限滾動使下一頁意味着從 api 的有效負載數據進行分頁。

工作代碼以及響應 200

import requests

api_url= 'https://www.shipspotting.com/ssapi/gallery-search'
headers={'content-type': 'application/json'}
payload= {"category":"","perPage":12,"page":1}

for payload['page'] in range(1,7):
    res=requests.post(api_url,headers=headers,json=payload)
    for item in res.json()['items']:
        title=item['title']
        print(title)

從動態加載的頁面中抓取網頁內容（無限滾動）

問題描述

2 個解決方案

解決方案1
1 已采納 2022-07-11 20:45:27

解決方案2
0 2022-07-11 20:58:58

從動態加載的頁面中抓取網頁內容（無限滾動）

問題描述

2 個解決方案

解決方案1 1 已采納 2022-07-11 20:45:27

解決方案2 0 2022-07-11 20:58:58

解決方案1
1 已采納 2022-07-11 20:45:27

解決方案2
0 2022-07-11 20:58:58