简体   繁体   中英

Scrape web content from dynamically loaded page (infinite scroll)

I am trying to collect all image filenames from this website: https://www.shipspotting.com/

I have already collected a python dict cat_dict of all the category names and their id numbers. So my strategy is to iterate through every category page, call the data loading API and save it's response for every page.

I have identified https://www.shipspotting.com/ssapi/gallery-search as the request URL which loads the next page of content. However, when I request this URL with the requests library, I get a 404. What do I need to do to obtain the correct response in loading the next page of content?

import requests
from bs4 import BeautifulSoup

cat_page = 'https://www.shipspotting.com/photos/gallery?category='

for cat in cat_dict:
   cat_link = cat_page + str(cat_dict[cat])
   headers = {
   "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:96.0) Gecko/20100101 Firefox/96.0",
   "Referer": cat_link
}

   response = requests.get('https://www.shipspotting.com/ssapi/gallery-search', headers=headers)
   soup = BeautifulSoup(response.text, 'html.parser')

https://www.shipspotting.com/photos/gallery?category=169 is an example page ( cat_link )

Every time you scroll the page down, a new request to server is being made (a POST one, with a certain payload). You can verify this in Dev tools, Network tab.

The following works:

import requests
from bs4 import BeautifulSoup

### put the following code in a for loop based on a number of pages 
### [total number of ship photos]/[12], or async it ... your choice

data = {"category":"","perPage":12,"page":2} 
r = requests.post('https://www.shipspotting.com/ssapi/gallery-search', data = data)
print(r.json())

This returns a json response:

{'page': 1, 'items': [{'lid': 3444123, 'cid': 172, 'title': 'ELLBING II', 'imo_no': '0000000',....}

Your idenntified url contains data via API as post method and infinity scroll makes the next pages meaning pagination from api's payload data.

Working code along with response 200

import requests

api_url= 'https://www.shipspotting.com/ssapi/gallery-search'
headers={'content-type': 'application/json'}
payload= {"category":"","perPage":12,"page":1}

for payload['page'] in range(1,7):
    res=requests.post(api_url,headers=headers,json=payload)
    for item in res.json()['items']:
        title=item['title']
        print(title)



    

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM