简体   繁体   English

使用 python 将网页另存为文件

[英]using python to save webpage as a file

I am new to python and my work is trying to export some historical data.我是 python 的新手,我的工作是尝试导出一些历史数据。 What I am trying to do is save hundreds of url links as individual pdfs so we don't have to click and save each one by one.我想做的是将数百个 url 链接保存为单独的 pdf,这样我们就不必逐一单击并保存。 The urls are direct links to forms that I would like to download.这些网址是指向我要下载的 forms 的直接链接。 The webpage also has username password authentication.网页也有用户名密码认证。 I cant seem to get python to export the url link in any format;我似乎无法获得 python 以任何格式导出 url 链接; at first it seemed as if the webpage was not allowing me access because of the username/password but after I added the requests.get and auth piece, the script seems to run but no export is created.起初,由于用户名/密码,网页似乎不允许我访问,但在我添加了 requests.get 和 auth 部分后,脚本似乎运行但没有创建导出。

as one of the commenters suggested pywebcopy, i tried it and this tool successfully creates a folder and a html file in the destination with the correct url file name but the file itself is blank.正如其中一位评论者建议的 pywebcopy,我试过了,这个工具成功地在目标位置创建了一个文件夹和一个 html 文件,文件名正确,文件名为 url,但文件本身是空白的。 I added the authentication piece but it made no difference as the saved html file is still blank.我添加了验证件,但没有任何区别,因为保存的 html 文件仍然是空白的。


import requests

requests.get('main website url', auth=('username','password'))

urls = ['url1','url2','url3' etc]

output_dir = 'folder on my drive'

for url in urls:
    response = requests.get(url)
    if response.status_code == 200:
        file_path = os.path.join(output_dir, os.path.basename(url))
        with open(file_path, 'wb') as f:
            f.write(response.content)

this is the http response这是 http 响应

DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): apps.bell.com:443
send: b'GET / HTTP/1.1\r\nHost: apps.bell.com\r\nUser-Agent: python-requests/2.27.1\r\nAccept-Encoding: gzip, deflate, br\r\nAccept: */*\r\nConnection: keep-alive\r\n\r\n'
reply: 'HTTP/1.1 302 \r\n'
header: Set-Cookie: JSESSIONID=B521B7D0D66595F87A12E174CA2C4CAD; Path=/; Secure; HttpOnly
header: Location: https://apps.bell.com/apps/bell/bell.bellmain
header: Content-Type: text/html;charset=ISO-8859-1
header: Content-Length: 0
header: Date: Mon, 16 May 2022 00:26:34 GMT
header: Keep-Alive: timeout=1
header: Connection: keep-alive
DEBUG:urllib3.connectionpool:https://apps.bell.com:443 "GET / HTTP/1.1" 302 0
send: b'GET /apps/bell/bell.bellmain HTTP/1.1\r\nHost: apps.bell.com\r\nUser-Agent: python-requests/2.27.1\r\nAccept-Encoding: gzip, deflate, br\r\nAccept: */*\r\nConnection: keep-alive\r\nCookie: JSESSIONID=B521B7D0D66595F87A12E174CA2C4CAD\r\n\r\n'
reply: 'HTTP/1.1 401 \r\n'
header: WWW-Authenticate: Basic realm="bellProduction System - V8MU.Q3"
header: Content-Type: text/html
header: Content-Length: 522
header: Date: Mon, 16 May 2022 00:26:34 GMT
header: Keep-Alive: timeout=1
header: Connection: keep-alive
DEBUG:urllib3.connectionpool:https://apps.bell.com:443 "GET /apps/bell/bell.bellmain HTTP/1.1" 401 522
PS > 

The auth keyword argument to that function expects an authentication object. For convenience , if passed a tuple, it acts as though it was asked to do HTTP Basic Authentication . function 的auth关键字参数需要身份验证 object。为方便起见,如果传递了一个元组,它就好像被要求执行HTTP 基本身份验证一样。 This authentication mechanism is not stateful, so you have to pass the auth parameter to every get call.这种身份验证机制不是有状态的,因此您必须将auth参数传递给每个get调用。

You might be saying: "But I don't have to do that in my browser".您可能会说:“但我不必在浏览器中这样做”。 And that's correct.这是正确的。 Most web browsers these days (definitely Firefox and Chrome, I can personally attest to) will remember HTTP Basic Auth credentials for websites you've been to and automatically send them if asked again for the same site, so you don't see the same prompt a bunch of times.现在大多数 web 浏览器(肯定是 Firefox 和 Chrome,我个人可以证明)会记住你去过的网站的 HTTP 基本身份验证凭据,如果再次询问同一网站,会自动发送它们,所以你看不到相同的提示了很多次。 But that's something your web browser does, not something the server does.但这是您的 web 浏览器所做的事情,而不是服务器所做的事情。 So when you're making HTTP requests by hand, you're responsible for doing the same.因此,当您手动发出 HTTP 请求时,您有责任执行相同的操作。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM