简体   繁体   English

如何刮取受密码保护的ASPX(PDF)页面

[英]How to scrape a password-protected ASPX (PDF) page

I'm trying to scrape data about my band's upcoming shows from our agent's web service (such as venue capacity, venue address, set length, set start time ...). 我正在尝试从代理商的网络服务中获取有关乐队即将举行的演出的数据(例如场地容量,场地地址,设置的长度,设置的开始时间...)。

With Python 3.6 and Selenium I've successfully logged in to the site, scraped a bunch of data from the main page, and opened the deal sheet, which is a PDF-like ASPX page. 使用Python 3.6和Selenium,我已经成功登录该站点,从主页上抓取了一些数据,并打开了交易单,该交易单是一个类似PDF的ASPX页面。 From there I'm unable to scrape the deal sheet. 我无法从那里刮下交易单。 I've successfully switched the Selenium driver to the deal sheet. 我已经成功将Selenium驱动程序切换到了交易单。 But when I inspect that page, none of the content is there, just a list of JavaScript scripts. 但是,当我检查该页面时,没有任何内容,只有一个JavaScript脚本列表。

I tried... 我试过了...

innerHTML = driver.execute_script("return document.body.innerHTML") 

...but this yields the same list of scripts rather than the PDF content I can see in the browser. ...但是这会产生相同的脚本列表,而不是我在浏览器中看到的PDF内容。

I've tried the solution suggested here: Python scraping pdf from URL 我已经尝试过这里建议的解决方案: Python从URL抓取pdf

But the HTML that solution returns is for the login page, not the deal sheet. 但是,解决方案返回的HTML用于登录页面,而不是交易单。 My problem is different because the PDF is protected by a password. 我的问题有所不同,因为PDF受密码保护。

You won't be able to read the PDF file using Selenium Python API bindings , the solution would be: 您将无法使用Selenium Python API绑定读取PDF文件,解决方案是:

  1. Download the file from the web page using requests library. 使用请求库从网页下载文件。 Given you need to be logged in my expectation is that you might need to fetch cookies from the browser session via driver.get_cookies() command and add them to the request which will download the PDF file 鉴于您需要登录,我的期望是您可能需要通过driver.get_cookies()命令从浏览器会话中获取Cookie并将其添加到将下载PDF文件的请求中
  2. Once you download the file you will be able to read its content using, for instance, PyPDF2 下载文件后,您将能够使用例如PyPDF2读取其内容

This 3-part solution works for me: 此三部分解决方案对我有用:

Part 1 (Get the URL for the password protected PDF) 第1部分(获取受密码保护的PDF的URL)

# with selenium
driver.find_element_by_xpath('xpath To The PDF Link').click()

# wait for the new window to load
sleep(6)

# switch to the new window that just popped up
driver.switch_to.window(driver.window_handles[1])

# get the URL to the PDF
plugin = driver.find_element_by_css_selector("#plugin")        
url = plugin.get_attribute("src")    

The element with the url might be different on your page. 带有url的元素在您的页面上可能有所不同。 Michael Kennedy also suggested #embed and #content. 迈克尔·肯尼迪还建议#embed和#content。

Part 2 (Create a persistent session with python requests, as described here: How to "log in" to a website using Python's Requests module? . And download the PDF.) 第2部分(使用python请求创建持久会话,如此处所述: 如何使用Python的Requests模块“登录”网站?并下载PDF。)

# Fill in your details here to be posted to the login form.
# Your parameter names are probably different. You can find them by inspecting the login page.
payload = {
    'logOnCode': username,
    'passWord': password
}

# Use 'with' to ensure the session context is closed after use.
with requests.Session() as session:
    session.post(logonURL, data=payload)

    # An authorized request.
    f = session.get(url) # this is the protected url
    open('c:/yourFilename.pdf', 'wb').write(f.content)

Part 3 (Scrape the PDF with PyPDF2 as suggested by Dmitri T ) 第3部分(按照Dmitri T的建议,使用PyPDF2抓取 PDF)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM