简体   繁体   English

使用 python 请求保存整个 web 页面而不是基本的 html 请求

[英]Save a whole web page instead of basic html with python requests for scraping

So I want to use Beautiful Soup to scrape this page: https://www.nseindia.com/option-chain#optionchain_equity and I access it using requests module.所以我想用 Beautiful Soup 来抓取这个页面: https://www.nseindia.com/option-chain#optionchain_equity我使用请求模块访问它。 But I guess requests saves only the basic html not the main table in that page.但我猜 requests 只保存基本的 html 而不是该页面中的主表。 Using chrome to dowload "Webpage, Complete" works but how can I automate it in python?使用 chrome 下载“网页,完成”有效,但如何在 python 中自动化它? Also without those headers, requests times out so it's necessary I guess.同样没有这些标头,请求会超时,所以我猜是有必要的。 Code:代码:

import requests

url = "https://www.nseindia.com/option-chain#optionchain_equity"
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
                         'Chrome/80.0.3987.149 Safari/537.36',
           'accept-language': 'en,gu;q=0.9,hi;q=0.8', 'accept-encoding': 'gzip, deflate, br'}
response = requests.get(url, headers=headers, timeout=5)
file = open("nse.html", "w")
file.write(response.text)

If you are mainly looking for the table data, then that table data is loaded via ajax call.如果您主要是查找表数据,则该表数据通过 ajax 调用加载。

The following script mainly saves the data to a json file.以下脚本主要将数据保存到 json 文件中。

import requests, json

headers = {'user-agent':"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36"}

res = requests.get("https://www.nseindia.com/api/option-chain-indices?symbol=NIFTY", headers=headers)

with open("data.json", "w") as f:
     json.dump(res.json(), f)

if u want to save a whole web page, u may try to find something like a headless chrome API, something like that:如果您想保存整个 web 页面,您可以尝试找到类似无头镀铬 API 之类的东西:

Download file through Google Chrome in headless mode 在无头模式下通过 Google Chrome 下载文件

To interrupt a web page, using a simple python won't help, it just handle as a file reading stream, what you want is a file reading and the web browser behavior, a headless chrome API is the way to go.... To interrupt a web page, using a simple python won't help, it just handle as a file reading stream, what you want is a file reading and the web browser behavior, a headless chrome API is the way to go....

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM