简体   繁体   English

如何从将PDF嵌入HTML的URL下载PDF文件?

[英]How can I download a PDF file from an URL where the PDF is embedded into the HTML?

What I'm trying to do: I want to scrape a web page to get the amount of a financial transaction from a PDF file that is loaded with javascript from a website. 我正在尝试做的事情:我想抓取网页以从网站上加载了javascript的PDF文件中获取财务交易额。 Example website: http://www.nebraskadeedsonline.us/document.aspx?g5savSPtTDnumMn1bRBWoKqN6Gu65tBhDE9%2fVs5YdPg= 示例网站: http : //www.nebraskadeedsonline.us/document.aspx?g5savSPtTDnumMn1bRBWoKqN6Gu65tBhDE9%2fVs5YdPg=

When I click the 'View Document' button, the PDF file loads into my browser's window (I'm using Google Chrome). 当我单击“查看文档”按钮时,PDF文件会加载到浏览器的窗口中(我使用的是Google Chrome)。 I can right-click on the PDF and save it to my computer, but I want to automate that process by either having Selenium (or similar package) download that file and then process it for OCR. 我可以右键单击PDF并将其保存到计算机中,但是我想通过让Selenium(或类似软件包)下载该文件然后对其进行OCR处理来自动化该过程。

If I can get it saved, I will be able to do the OCR part (I hope). 如果可以保存它,则可以执行OCR部分(我希望如此)。 I just can't get the file saved. 我只是无法保存文件。

From here , I found and modified this code: 这里 ,我找到并修改了以下代码:

def download_pdf(lnk):

    from selenium import webdriver
    from time import sleep

    options = webdriver.ChromeOptions()

    download_folder = "C:\\Users\\rickc\\Documents\\Scraper2\\screenshots\\"

    profile = {"plugins.plugins_list": [{"enabled": False,
                                         "name": "Chrome PDF Viewer"}],
               "download.default_directory": download_folder,
               "download.extensions_to_open": ""}

    options.add_experimental_option("prefs", profile)

    print("Downloading file from link: {}".format(lnk))

    driver = webdriver.Chrome(chrome_options = options)
    driver.get(lnk)

    filename = lnk.split("/")[3].split(".aspx")[0]+".pdf"
    print("File: {}".format(filename))

    print("Status: Download Complete.")
    print("Folder: {}".format(download_folder))

    driver.close()

download_pdf('http://www.nebraskadeedsonline.us/document.aspx?g5savSPtTDnumMn1bRBWoKqN6Gu65tBhDE9fVs5YdPg=')

But it isn't working. 但这不起作用。 My old college professor once said, "If you've spent more than two hours on the problem and haven't made headway, it's time to look for help elsewhere." 我的大学教授曾经说过:“如果您在这个问题上花了两个多小时而又没有取得进展,那么该是时候在其他地方寻求帮助了。” So I'm looking for help. 因此,我正在寻求帮助。

Other info: The link above will take you to a web page, but you can't access the PDF document until you click on the 'View Document' button. 其他信息:上面的链接将带您进入网页,但只有在单击“查看文档”按钮后才能访问PDF文档。 I've tried using Selenium's webdriver.find_element_by_ID('btnDocument').click() to make things happen, and it just loads the page but doesn't do anything with it. 我已经尝试过使用Selenium的webdriver.find_element_by_ID('btnDocument').click()来使事情发生,它只是加载页面,但对此不做任何事情。

You can download pdf using requests and BeautifulSoup libraries. 您可以使用requestsBeautifulSoup库下载pdf。 In code below replace /Users/../aaa.pdf with full path where document will be downloaded: 在下面的代码中,将/Users/../aaa.pdf替换为下载文档的完整路径:

import requests
from bs4 import BeautifulSoup

url = 'http://www.nebraskadeedsonline.us/document.aspx?g5savSPtTDnumMn1bRBWoKqN6Gu65tBhDE9%2fVs5YdPg='

response = requests.post(url)
page = BeautifulSoup(response.text, "html.parser")

VIEWSTATE = page.select_one("#__VIEWSTATE").attrs["value"]
VIEWSTATEGENERATOR = page.select_one("#__VIEWSTATEGENERATOR").attrs["value"]
EVENTVALIDATION = page.select_one("#__EVENTVALIDATION").attrs["value"]
btnDocument = page.select_one("[name=btnDocument]").attrs["value"]

data = {
  '__VIEWSTATE': VIEWSTATE,
  '__VIEWSTATEGENERATOR': VIEWSTATEGENERATOR,
  '__EVENTVALIDATION': EVENTVALIDATION,
  'btnDocument': btnDocument
}
response = requests.post(url, data=data)
with open('/Users/../aaa.pdf', 'wb') as f:
    f.write(response.content)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM