簡體   English   中英

如何使用 RSelenium 從網頁下載嵌入式 PDF 文件?

[英]How to download embedded PDF files from webpage using RSelenium?

編輯:根據我目前收到的評論,我設法使用 RSelenium 訪問我正在尋找的 PDF 文件,使用以下代碼:

library(RSelenium)
driver <- rsDriver(browser = "firefox")
remote_driver <- driver[["client"]]
remote_driver$navigate("https://www.rad.cvm.gov.br/enetconsulta/frmGerenciaPaginaFRE.aspx?CodigoTipoInstituicao=1&NumeroSequencialDocumento=62398")
# It needs some time to load the page
option <- remote_driver$findElement(using = 'xpath', "//select[@id='cmbGrupo']/option[@value='PDF|412']")
option$clickElement()

現在,我需要 R 來單擊下載按鈕,但我無法這樣做。 我試過了:

button <- remote_driver$findElement(using = "xpath", "//*[@id='download']")
button$clickElement()

但我收到以下錯誤:

Selenium message:Unable to locate element: //*[@id="download"]
For documentation on this error, please visit: https://www.seleniumhq.org/exceptions/no_such_element.html
Build info: version: '4.0.0-alpha-2', revision: 'f148142cf8', time: '2019-07-01T21:30:10'

Erro:    Summary: NoSuchElement
 Detail: An element could not be located on the page using the given search parameters.
 class: org.openqa.selenium.NoSuchElementException
 Further Details: run errorDetails method

有人可以告訴這里有什么問題嗎? 謝謝!

原始問題:

我有幾個網頁需要下載嵌入式 PDF 文件,我正在尋找一種使用 R 自動化它的方法。 這是以下網頁之一: https://www.rad.cvm.gov.br/enetconsulta/frmGerenciaPaginaFRE.aspx?CodigoTipoInstituicao=1&NumeroSequencialDocumento=62398這是來自 CVM 的網頁( Comissão de Valores Mobiliários ,巴西相當於美國證券交易委員會 - SEC) 從巴西公司下載財務報表附注 ( Notas Explicativas )。

我嘗試了幾個選項,但該網站的構建方式似乎使提取直接鏈接變得困難。 我嘗試了此處從 URL 下載所有 PDF 中的建議,但是html_nodes(".ms-vb2 a") %>% html_attr("href")產生一個空字符向量。 同樣,當我在這里嘗試https://www.samuelworkman.org/blog/scraping-up-bits-of-helpfulness/中的方法時, html_attr("href")生成一個空向量。

我不習慣 web 在 R 中抓取代碼,所以我不知道發生了什么。 我很感激任何幫助!

如果有人遇到與我相同的問題,我將發布我使用的解決方案:

# set Firefox profile to download PDFs automatically
pdfprof <- makeFirefoxProfile(list(
  "pdfjs.disabled" = TRUE,
  "plugin.scan.plid.all" = FALSE,
  "plugin.scan.Acrobat" = "99.0",
  "browser.helperApps.neverAsk.saveToDisk" = 'application/pdf'))

driver <- rsDriver(browser = "firefox", extraCapabilities = pdfprof)
remote_driver <- driver[["client"]]
remote_driver$navigate("https://www.rad.cvm.gov.br/enetconsulta/frmGerenciaPaginaFRE.aspx?CodigoTipoInstituicao=1&NumeroSequencialDocumento=62398")
Sys.sleep(3) # It needs some time to load the page (set to 3 seconds)

option <- remote_driver$findElement(using = 'xpath', "//select[@id='cmbGrupo']/option[@value='PDF|412']") # select the option to open PDF file
option$clickElement()

# Find iframes in the webpage
web.elem <- remote_driver$findElements(using = "css", "iframe") # get all iframes in the webpage
sapply(web.elem, function(x){x$getElementAttribute("id")}) # see their names
remote_driver$switchToFrame(web.elem[[1]]) # Move to the first iframe (Formularios Filho)
web.elem.2 <- remote_driver$findElements(using = "css", "iframe") # get all iframes in the webpage
sapply(web.elem.2, function(x){x$getElementAttribute("id")}) # see their names
# The pdf Viewer iframe is the only one inside Formularios Filho
remote_driver$switchToFrame(web.elem.2[[1]]) # Move to the first iframe (pdf Viewer)
Sys.sleep(3) # It needs some time to load the page (set to 3 seconds)

# Download the PDF file
button <- remote_driver$findElement(using = "xpath", "//*[@id='download']")
button$clickElement() # download
Sys.sleep(3) # Need sometime to finish download and then close the window
remote_driver$close() # Close the window

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM