简体   繁体   English

使用JavaScript链接抓取网页

[英]Scraping a web with javascript links

I am using R for web scraping. 我正在使用R进行网页抓取。 The information that I need is in the links of this webpage. 我需要的信息在此网页的链接中 But when I click, the link goes to the same page I was on. 但是,当我单击时,链接将转到我所在的页面。 How can I scrape the info following these other links until I get the tables with the information I need? 在获得包含所需信息的表格之前,如何在其他链接后面抓取信息? I started using R several months ago and I know httr, Curl and other packages, but I am not able to scrape this webpage. 几个月前,我开始使用R,我知道httr,Curl和其他软件包,但是我无法抓取此网页。 I need an output such as this (through clicking "Todo el territorio" and Tipo de estudios: "Bachillerato"): 我需要这样的输出(通过单击“ Todo el territorio”和Tipo de estudios:“ Bachillerato”):

Provincia|Localidad|Denominacion Generica|Denominacion Especifica|Codigo|Naturaleza
Almería|Adra|Instituto de Educación Secundaria|Abdera|04000110|Centro público
Almería|Adra|Instituto de Educación Secundaria|Gaviota|04000134|Centro público

... ...

This would be my general script using Selenium package but it does not work and I accept any option: 这将是我使用Selenium软件包的常规脚本,但它不起作用,我接受任何选项:

library(RSelenium)
library(XML)
library(magrittr)

RSelenium::checkForServer()
RSelenium::startServer()
remDrv <- RSelenium::remoteDriver(remoteServerAddr = "localhost", port = 4444, browserName = "chrome")
remDrv$open()

remDrv$navigate('https://www.educacion.gob.es/centros/selectaut.do')
remDrv$findElement(using = "xpath", "//select[@name = '.listado-inicio']/option[@value = ('02','00')]")$clickElement()

...

or something like this. 或类似的东西。 I have found something similar to this script looking for other topics in stackoverflow but I do not get anything. 我发现类似此脚本的东西在stackoverflow中寻找其他主题,但我什么也没得到。 I accept other solutions with other scripts. 我接受其他脚本提供的其他解决方案。 Thanks a lot. 非常感谢。

Using 'RSelenium' to navigate the site you could do: 使用“ RSelenium”浏览站点,您可以执行以下操作:

library(RSelenium)
library(rvest)
#start RSelenium
checkForServer()
startServer()
remDr <- remoteDriver()
remDr$open()

remDr$navigate('https://www.educacion.gob.es/centros/selectaut.do')

#Click on the todo el territorio link
remDr$findElement(using = "xpath", "//a[text()='Todo el territorio']")$clickElement()

#select the Bachillerato option (has a value of 133) and click on the search button
remDr$findElement(using = "xpath", "//select[@id='comboniv']/option[@value='133']")$clickElement()
remDr$findElement(using = "xpath", "//input[@id='idGhost']")$clickElement()

#Click on the show results button
remDr$findElement(using = "xpath", "//input[@title='Buscar']")$clickElement()

#parse the html and get the table
doc <- htmlParse(remDr$getPageSource()[[1]],encoding="UTF-8")
data <- readHTMLTable(doc)$matcentro

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM