简体   繁体   中英

using python to save webpage as a file

I am new to python and my work is trying to export some historical data. What I am trying to do is save hundreds of url links as individual pdfs so we don't have to click and save each one by one. The urls are direct links to forms that I would like to download. The webpage also has username password authentication. I cant seem to get python to export the url link in any format; at first it seemed as if the webpage was not allowing me access because of the username/password but after I added the requests.get and auth piece, the script seems to run but no export is created.

as one of the commenters suggested pywebcopy, i tried it and this tool successfully creates a folder and a html file in the destination with the correct url file name but the file itself is blank. I added the authentication piece but it made no difference as the saved html file is still blank.


import requests

requests.get('main website url', auth=('username','password'))

urls = ['url1','url2','url3' etc]

output_dir = 'folder on my drive'

for url in urls:
    response = requests.get(url)
    if response.status_code == 200:
        file_path = os.path.join(output_dir, os.path.basename(url))
        with open(file_path, 'wb') as f:
            f.write(response.content)

this is the http response

DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): apps.bell.com:443
send: b'GET / HTTP/1.1\r\nHost: apps.bell.com\r\nUser-Agent: python-requests/2.27.1\r\nAccept-Encoding: gzip, deflate, br\r\nAccept: */*\r\nConnection: keep-alive\r\n\r\n'
reply: 'HTTP/1.1 302 \r\n'
header: Set-Cookie: JSESSIONID=B521B7D0D66595F87A12E174CA2C4CAD; Path=/; Secure; HttpOnly
header: Location: https://apps.bell.com/apps/bell/bell.bellmain
header: Content-Type: text/html;charset=ISO-8859-1
header: Content-Length: 0
header: Date: Mon, 16 May 2022 00:26:34 GMT
header: Keep-Alive: timeout=1
header: Connection: keep-alive
DEBUG:urllib3.connectionpool:https://apps.bell.com:443 "GET / HTTP/1.1" 302 0
send: b'GET /apps/bell/bell.bellmain HTTP/1.1\r\nHost: apps.bell.com\r\nUser-Agent: python-requests/2.27.1\r\nAccept-Encoding: gzip, deflate, br\r\nAccept: */*\r\nConnection: keep-alive\r\nCookie: JSESSIONID=B521B7D0D66595F87A12E174CA2C4CAD\r\n\r\n'
reply: 'HTTP/1.1 401 \r\n'
header: WWW-Authenticate: Basic realm="bellProduction System - V8MU.Q3"
header: Content-Type: text/html
header: Content-Length: 522
header: Date: Mon, 16 May 2022 00:26:34 GMT
header: Keep-Alive: timeout=1
header: Connection: keep-alive
DEBUG:urllib3.connectionpool:https://apps.bell.com:443 "GET /apps/bell/bell.bellmain HTTP/1.1" 401 522
PS > 

The auth keyword argument to that function expects an authentication object. For convenience , if passed a tuple, it acts as though it was asked to do HTTP Basic Authentication . This authentication mechanism is not stateful, so you have to pass the auth parameter to every get call.

You might be saying: "But I don't have to do that in my browser". And that's correct. Most web browsers these days (definitely Firefox and Chrome, I can personally attest to) will remember HTTP Basic Auth credentials for websites you've been to and automatically send them if asked again for the same site, so you don't see the same prompt a bunch of times. But that's something your web browser does, not something the server does. So when you're making HTTP requests by hand, you're responsible for doing the same.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM