简体   繁体   中英

Python Requests returns 403 error when downloading a PDF file

I have been trying to download a PDF file using requests but, no matter what I do, it keeps returning 403 as status and it is impossible to open the downloaded PDF.

Here is the code I am running:

import requests   

url_pdf='https://www.agerborsamerci.it/wp-content/uploads/2022/01/Settimanale-n.-2-del-20-Gennaio-2022-%E2%80%93-Listino-Borsa-n.-2.pdf'
   
    #session = requests.Session()

    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36",
        "Accept": "image/avif,image/webp,image/apng,image/svg+xml,image/*,*/*;q=0.8",
        "Cache-Control": "image/avif,image/webp,image/apng,image/svg+xml,image/*,*/*;q=0.8",
        "host-header": "6b7412fb82ca5edfd0917e3957f05d89",
        "Accept-Encoding": "gzip, deflate, br",
        "cache-control": "image/avif,image/webp,image/apng,image/svg+xml,image/*,*/*;q=0.8",
        "Connection": "keep-alive",
        "referer":"https://www.agerborsamerci.it/wp-content/uploads/2022/01/Settimanale-n.-2-del-20-Gennaio-2022-%E2%80%93-Listino-Borsa-n.-2.pdf"
    }

req=requests.get(url_pdf,  headers=headers)
print(req.status_code)

with open("bologna.pdf", 'wb') as f:
  f.write(req.content)
f.closed

As you can see, I have tried using a 'Session' object, setting (different) 'User-Agent' as well as other headers but nothing seems to work.

I have also tried using

import os
name='bologna.pdf'    
os.system('wget {} -O {}'.format(url_pdf,name))

But it is not working either.

Do you have any idea about what could I do to overcome this problem? I am really struggling to figure it out.

Thank you a lot!

Avoid sending headers unless required, try anonymouse default first (they still get your IP details) and only takes 2 seconds to download:-

curl -o bologna.pdf  https://www.agerborsamerci.it/wp-content/uploads/2022/01/Settimanale-n.-2-del-20-Gennaio-2022-%E2%80%93-Listino-Borsa-n.-2.pdf

Works for my curl enhanced Windows 7 and should work naturally in win10 or 11

>curl -o bologna.pdf https://www.agerborsamerci.it/wp-content/uploads/2022/01/Settimanale-n.-2-del-20-Gennaio-2022-%E2%80%93-Listino-Borsa-n.-2.pdf
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  346k  100  346k    0     0   117k      0  0:00:02  0:00:02 --:--:--  117k

在此处输入图像描述

A 403 error means that you do not have permission to access the page.

Per the link above,

The HTTP 403 Forbidden response status code indicates that the server understands the request but refuses to authorize it.

I would recommend looking into figuring out what is the relevant permission needed to be on that site/page.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM