Selenium Firefox 浏览器下载后卡住 pdf

Question

希望有人可以帮助我了解发生了什么：

我正在使用 Selenium 和 Firefox 浏览器下载 pdf（需要 ZC49DFB5555F06BB406E38ZCA 登录到相应的网站）

    le = browser.find_elements_by_xpath('//*[@title="Download PDF"]')
    time.sleep(5)
    if le:
        pdf_link = le[0].get_attribute("href")
        browser.get(pdf_link)

该代码确实下载了 pdf，但之后就一直处于空闲状态。 这似乎与以下浏览器设置有关：

   fp.set_preference("pdfjs.disabled", True)
   fp.set_preference("browser.helperApps.neverAsk.saveToDisk", "application/pdf")

如果我禁用第一个，它不会挂起，而是打开 pdf 而不是下载它。 如果我禁用第二个，则会出现“另存为”弹出窗口 window。 有人可以解释如何处理吗？

Answer 1

For me, the best way to solve this was to let Firefox render the PDF in the browser via pdf.js and then send a subsequent fetch via the Python requests library with the selenium cookies attached. 更多解释如下：

有几种方法可以通过 Firefox + Selenium 渲染 PDF。 如果您使用的是最新版本的 Firefox，它很可能会通过pdf.js渲染 PDF，以便您可以在线查看。 这并不理想，因为现在我们无法下载文件。

您可以通过 Selenium 选项禁用 pdf.js ，但这可能会导致浏览器卡住的问题。 这可能是因为未知的 MIME 类型，但我不完全确定。 （还有另一个 StackOverflow 答案说这也是由于 Firefox 版本。）

但是，我们可以通过将 Selenium 的 cookie session 传递给requests.session()来绕过这个问题。

这是一个玩具示例：

import requests
from selenium import webdriver

pdf_url = "/url/to/some/file.pdf"

# setup driver with options
driver = webdriver.Firefox(..options)

# do whatever you need to do to auth/login/click/etc.

# navigate to the PDF URL in case the PDF link issues a 
# redirect because requests.session() does not persist cookies
driver.get(pdf_url)

# get the URL from Selenium 
current_pdf_url = driver.current_url

# create a requests session
session = requests.session()

# add Selenium's cookies to requests
selenium_cookies = driver.get_cookies()
for cookie in selenium_cookies:
    session.cookies.set(cookie["name"], cookie["value"])

# Note: If headers are also important, you'll need to use 
# something like seleniumwire to get the headers from Selenium 

# Finally, re-send the request with requests.session
pdf_response = session.get(current_pdf_url)

# access the bytes response from the session
pdf_bytes = pdf_response.content

我强烈建议在常规 selenium 上使用seleniumwire ，因为它扩展了 Python Selenium 让您返回标头，等待请求完成，使用代理等等，

Selenium Firefox 浏览器下载后卡住 pdf

问题描述

1 个解决方案

解决方案1
1 已采纳 2021-04-30 19:14:19

Selenium Firefox 浏览器下载后卡住 pdf

问题描述

1 个解决方案

解决方案1 1 已采纳 2021-04-30 19:14:19

解决方案1
1 已采纳 2021-04-30 19:14:19