简体   繁体   English

从动态加载的页面中抓取网页内容(无限滚动)

[英]Scrape web content from dynamically loaded page (infinite scroll)

I am trying to collect all image filenames from this website: https://www.shipspotting.com/我正在尝试从该网站收集所有图像文件名: https ://www.shipspotting.com/

I have already collected a python dict cat_dict of all the category names and their id numbers.我已经收集了所有类别名称及其 ID 号的 python dict cat_dict So my strategy is to iterate through every category page, call the data loading API and save it's response for every page.所以我的策略是遍历每个类别页面,调用数据加载 API 并保存每个页面的响应。

I have identified https://www.shipspotting.com/ssapi/gallery-search as the request URL which loads the next page of content.我已将https://www.shipspotting.com/ssapi/gallery-search确定为加载下一页内容的请求 URL。 However, when I request this URL with the requests library, I get a 404. What do I need to do to obtain the correct response in loading the next page of content?但是,当我使用 requests 库请求此 URL 时,我得到一个 404。在加载下一页内容时,我需要做什么才能获得正确的响应?

import requests
from bs4 import BeautifulSoup

cat_page = 'https://www.shipspotting.com/photos/gallery?category='

for cat in cat_dict:
   cat_link = cat_page + str(cat_dict[cat])
   headers = {
   "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:96.0) Gecko/20100101 Firefox/96.0",
   "Referer": cat_link
}

   response = requests.get('https://www.shipspotting.com/ssapi/gallery-search', headers=headers)
   soup = BeautifulSoup(response.text, 'html.parser')

https://www.shipspotting.com/photos/gallery?category=169 is an example page ( cat_link ) https://www.shipspotting.com/photos/gallery?category=169是一个示例页面( cat_link

Every time you scroll the page down, a new request to server is being made (a POST one, with a certain payload).每次向下滚动页面时,都会向服务器发出一个新请求(一个带有特定负载的 POST 请求)。 You can verify this in Dev tools, Network tab.您可以在开发工具的网络选项卡中验证这一点。

The following works:以下作品:

import requests
from bs4 import BeautifulSoup

### put the following code in a for loop based on a number of pages 
### [total number of ship photos]/[12], or async it ... your choice

data = {"category":"","perPage":12,"page":2} 
r = requests.post('https://www.shipspotting.com/ssapi/gallery-search', data = data)
print(r.json())

This returns a json response:这将返回一个 json 响应:

{'page': 1, 'items': [{'lid': 3444123, 'cid': 172, 'title': 'ELLBING II', 'imo_no': '0000000',....}

Your idenntified url contains data via API as post method and infinity scroll makes the next pages meaning pagination from api's payload data.您识别的 url 包含通过 API 作为 post 方法的数据,并且无限滚动使下一页意味着从 api 的有效负载数据进行分页。

Working code along with response 200工作代码以及响应 200

import requests

api_url= 'https://www.shipspotting.com/ssapi/gallery-search'
headers={'content-type': 'application/json'}
payload= {"category":"","perPage":12,"page":1}

for payload['page'] in range(1,7):
    res=requests.post(api_url,headers=headers,json=payload)
    for item in res.json()['items']:
        title=item['title']
        print(title)



    

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM