[英]How can i scrape next page data with python if next page load with java script, no URL change?
我正在尝试使用python 抓取网页。 我已经成功抓取了第一页,但是我无法将 go 转到下一页,因为下一页 URL 是相同的,并且下一页正在加载 javascript。
import requests
import bs4 as bs
url ='https://scamalert.sg/scam-details'
r = requests.get(url)
htmlcontent = r.content
soup = bs.BeautifulSoup(htmlcontent, 'html.parser')
for tag in soup.find_all('h4',{"class":"card-title"}):
print (tag.text)
[网站 HTML][1] [1]: https://i.stack.imgur.com/8zV9y.png
<a class-"page-1ink" href- "javascriptivoid (0) " onclick-"pagingOnCli ck('2') ">2
这是获取所有故事及其相关链接的方法之一,该链接指向遍历该站点所有下一页的详细信息页面。 If you consider checking the chrome dev tools, you will notice that post http requests are made to this url https://scamalert.sg/scam-details/GetStoryListAjax/
along with appropriate parameters to populate json content from which you can extract the desired字段。
import json
import requests
base = 'https://scamalert.sg{}'
link = 'https://scamalert.sg/scam-details/GetStoryListAjax/'
payload = {
'scamType': '',
'year': '',
'month': '',
'sortBy': 'Latest'
}
page = 1
while True:
payload['page'] = page
r = requests.post(link,data=payload)
items = json.loads(r.json()['result'])['StoryList']
if len(items)<=1:break
for item in items:
print(item['Title'],base.format(item['Url']))
page+=1
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.