[英]How to download embedded PDF files from webpage using RSelenium?
EDIT: From the comments I received so far, I managed to use RSelenium to access the PDF files I am looking for, using the following code:编辑:根据我目前收到的评论,我设法使用 RSelenium 访问我正在寻找的 PDF 文件,使用以下代码:
library(RSelenium)
driver <- rsDriver(browser = "firefox")
remote_driver <- driver[["client"]]
remote_driver$navigate("https://www.rad.cvm.gov.br/enetconsulta/frmGerenciaPaginaFRE.aspx?CodigoTipoInstituicao=1&NumeroSequencialDocumento=62398")
# It needs some time to load the page
option <- remote_driver$findElement(using = 'xpath', "//select[@id='cmbGrupo']/option[@value='PDF|412']")
option$clickElement()
Now, I need R to click the download button, but I could not manage to do so.现在,我需要 R 来单击下载按钮,但我无法这样做。 I tried:
我试过了:
button <- remote_driver$findElement(using = "xpath", "//*[@id='download']")
button$clickElement()
But I get the following error:但我收到以下错误:
Selenium message:Unable to locate element: //*[@id="download"]
For documentation on this error, please visit: https://www.seleniumhq.org/exceptions/no_such_element.html
Build info: version: '4.0.0-alpha-2', revision: 'f148142cf8', time: '2019-07-01T21:30:10'
Erro: Summary: NoSuchElement
Detail: An element could not be located on the page using the given search parameters.
class: org.openqa.selenium.NoSuchElementException
Further Details: run errorDetails method
Can someone tell what is wrong here?有人可以告诉这里有什么问题吗? Thanks!
谢谢!
Original question:原始问题:
I have several webpages from which I need to download embedded PDF files and I am looking for a way to automate it with R.我有几个网页需要下载嵌入式 PDF 文件,我正在寻找一种使用 R 自动化它的方法。 This is one of the webpages: https://www.rad.cvm.gov.br/enetconsulta/frmGerenciaPaginaFRE.aspx?CodigoTipoInstituicao=1&NumeroSequencialDocumento=62398 This is a webpage from CVM ( Comissão de Valores Mobiliários , the Brazilian equivalent to the US Securities and Exchange Commission - SEC) to download Notes to Financial Statements ( Notas Explicativas ) from Brazilian companies.
这是以下网页之一: https://www.rad.cvm.gov.br/enetconsulta/frmGerenciaPaginaFRE.aspx?CodigoTipoInstituicao=1&NumeroSequencialDocumento=62398这是来自 CVM 的网页( Comissão de Valores Mobiliários ,巴西相当于美国证券交易委员会 - SEC) 从巴西公司下载财务报表附注 ( Notas Explicativas )。
I tried several options but the website seems to be built in a way that makes it difficult to extract the direct links.我尝试了几个选项,但该网站的构建方式似乎使提取直接链接变得困难。 I tried what is suggested in here Downloading all PDFs from URL , but the
html_nodes(".ms-vb2 a") %>% html_attr("href")
yields an empty character vector.我尝试了此处从 URL 下载所有 PDF 中的建议,但是
html_nodes(".ms-vb2 a") %>% html_attr("href")
产生一个空字符向量。 Similarly, when I tried the approach in here https://www.samuelworkman.org/blog/scraping-up-bits-of-helpfulness/ , the html_attr("href")
generates an empty vector.同样,当我在这里尝试https://www.samuelworkman.org/blog/scraping-up-bits-of-helpfulness/中的方法时,
html_attr("href")
生成一个空向量。
I am not used to web scraping codes in R, so I cannot figure out what is happening.我不习惯 web 在 R 中抓取代码,所以我不知道发生了什么。 I appreciate any help!
我很感激任何帮助!
If someone is facing the same problem I did, I am posting the solution I used:如果有人遇到与我相同的问题,我将发布我使用的解决方案:
# set Firefox profile to download PDFs automatically
pdfprof <- makeFirefoxProfile(list(
"pdfjs.disabled" = TRUE,
"plugin.scan.plid.all" = FALSE,
"plugin.scan.Acrobat" = "99.0",
"browser.helperApps.neverAsk.saveToDisk" = 'application/pdf'))
driver <- rsDriver(browser = "firefox", extraCapabilities = pdfprof)
remote_driver <- driver[["client"]]
remote_driver$navigate("https://www.rad.cvm.gov.br/enetconsulta/frmGerenciaPaginaFRE.aspx?CodigoTipoInstituicao=1&NumeroSequencialDocumento=62398")
Sys.sleep(3) # It needs some time to load the page (set to 3 seconds)
option <- remote_driver$findElement(using = 'xpath', "//select[@id='cmbGrupo']/option[@value='PDF|412']") # select the option to open PDF file
option$clickElement()
# Find iframes in the webpage
web.elem <- remote_driver$findElements(using = "css", "iframe") # get all iframes in the webpage
sapply(web.elem, function(x){x$getElementAttribute("id")}) # see their names
remote_driver$switchToFrame(web.elem[[1]]) # Move to the first iframe (Formularios Filho)
web.elem.2 <- remote_driver$findElements(using = "css", "iframe") # get all iframes in the webpage
sapply(web.elem.2, function(x){x$getElementAttribute("id")}) # see their names
# The pdf Viewer iframe is the only one inside Formularios Filho
remote_driver$switchToFrame(web.elem.2[[1]]) # Move to the first iframe (pdf Viewer)
Sys.sleep(3) # It needs some time to load the page (set to 3 seconds)
# Download the PDF file
button <- remote_driver$findElement(using = "xpath", "//*[@id='download']")
button$clickElement() # download
Sys.sleep(3) # Need sometime to finish download and then close the window
remote_driver$close() # Close the window
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.