简体   繁体   中英

Problem during Download of pdf file using Python

From https://research.un.org/en/docs/ga/quick/regular/76 I intend to download the first resolution (A/RES/76/307), which has the link ( https://undocs.org/en/A/RES/76/307 ) and which then is transformed to https://documents-dds-ny.un.org/doc/UNDOC/GEN/N22/587/47/PDF/N2258747.pdf?OpenElement , when clicked on.

I use the standard code to download pdfs:

import requests

url = "https://undocs.org/en/A/RES/76/307"
response = requests.get(url)

print(response.status_code)
print(response.content)

with open("document.pdf", "wb") as f:
    f.write(response.content)

While the status_code indicates everything is okay (200), the content simply is:

b'\n<head>\n</head>\n<body text="#000000">\n<META HTTP-EQUIV="refresh" CONTENT="1; URL=/tmp/1286884.54627991.html">\n</body>\n</html>\n'

, which is evidently not the actual content of the document. A pdf file is saved, but it is much too small and I cannot open it with Document viewer ("File type HTML document (text/html) is not supported").

How can I download that pdf file using python?

You wont be able to download the pdf file via Requests if you have no actual download link. The website you are refering to is opening the pdf inside the browser itself.

Using Selenium/BeautifulSoup can resolve this. With BeautifulSoup we extract from the response the temporary Url to the pdf file.

soup = BeautifulSoup(response.text, 'html.parser')
print(soup)
meta = soup.find('meta')
url="https://daccess-ods.un.org"+meta['content'].split('URL=')[1]
#output https://daccess-ods.un.org/tmp/6937936.54441834.html

With Selenium we open the browser with the pdf plugin to finally download the pdf file itself. The complete code could look like this:

import requests
from bs4 import BeautifulSoup
import time

url = "https://undocs.org/en/A/RES/76/307"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

meta = soup.find('meta')
url = "https://daccess-ods.un.org" + meta['content'].split('URL=')[1]


def download_pdf(lnk):
    from selenium import webdriver
    from time import sleep

    options = webdriver.ChromeOptions()

    download_folder = "C:\\test\\"

    profile = {"plugins.plugins_list": [{"enabled": False,
                                         "name": "Chrome PDF Viewer"}],
                                         "download.default_directory": download_folder,
                                         "download.extensions_to_open": "",
                                         "plugins.always_open_pdf_externally": True}

    options.add_experimental_option("prefs", profile)

    print("Downloading file from link: {}".format(lnk))
    driver = webdriver.Chrome(chrome_options=options)
    driver.get(lnk)
    filename = lnk.split("/")[4].split(".cfm")[0]
    print("File: {}".format(filename))
    time.sleep(5)
    print("Status: Download Complete.")
    print("Folder: {}".format(download_folder))
    driver.close()

print(url)
download_pdf(url)

(Shoutout: The selenium part is partly from Python Download PDF Embedded in a Page )

Not very experienced so my answer might not be the best documented.

You could try using Beautiful Soup. It is easy to learn what you're looking to do here. It allows you to search elements in the code of the webpage and download the element in a very easy and straightforward way.

There are some links here that you may visit and find useful information.

https://www.geeksforgeeks.org/downloading-pdfs-with-python-using-requests-and-beautifulsoup/

https://beautiful-soup-4.readthedocs.io/en/latest/

I would help you more precisely but it's a long time without using it.

Anyways it's easy to find the info I'm talking about and you will be able to adapt it to your code.

I assume there's not a problem using BS. If there is I can't help you anymore. :)

Have fun coding!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM