I want to scrape the urls from the following page:
http://www.europarl.europa.eu/meps/en/1186/seeall.html?type=CRE&leg=5
There are 180 urls to be collected from this page (each is a link to a speech given in Parliament), but I am running into problems whenever there are more than 100 urls to be scraped, as the additional speeches are only accessible by clicking on the "See More" box at the bottom of the page. I've tried to figure out how to reveal the additional links that I think are hidden by the "getMore" function, but with no luck! Apologies for naiveté here...
My current code is as follows:
mep.speech.list.url <-"http://www.europarl.europa.eu/meps/en/1186/seeall.html?type=CRE&leg=5"
speech.list.data<-try(readLines(mep.speech.list.url),silent=TRUE)
mep.speech.list<-speech.list.data
mep.speech.lines<-grep("href",mep.speech.list)
mep.speech.list<-mep.speech.list[mep.speech.lines]
mep.speech.lines<-grep("target",mep.speech.list)
mep.speech.list<-mep.speech.list[mep.speech.lines]
mep.speech.list<-mep.speech.list[-length(mep.speech.list)]
mep.speech.list.end<-regexpr("target",mep.speech.list)
mep.speech.list<-substr(mep.speech.list,1, mep.speech.list.end)
mep.speech.list<-gsub("\t","",mep.speech.list)
mep.speech.list<-gsub('<a href=\"',"",mep.speech.list)
mep.speech.list<-gsub('\" target',"",mep.speech.list)
mep.speech.list<-gsub('\" targe',"",mep.speech.list)
mep.speech.list<-gsub('\" targ',"",mep.speech.list)
mep.speech.list<-gsub('\" tar',"",mep.speech.list)
mep.speech.list<-gsub('\" ta',"",mep.speech.list)
mep.speech.list<-gsub('\" t',"",mep.speech.list)
mep.speech.list<-mep.speech.list[5:length(mep.speech.list)]
print(mep.speech.list)
The SEE MORE button executes some javascript that carries out an AJAX call. You can use Selenium to automate the browser and extract the links:
require(RSelenium)
appURL <- "http://www.europarl.europa.eu/meps/en/1186/seeall.html?type=CRE&leg=5"
RSelenium::startServer()
remDr <- remoteDriver()
remDr$open()
remDr$navigate(appURL)
remDr$findElement("id", "seemore")$clickElement()
Sys.sleep(5)
jsScript <-"var hrefs = new Array();
$('#content_left .listcontent a').each(function(){
hrefs.push($(this).attr('href'));
});
return hrefs;"
appHREF <- remDr$executeScript(jsScript)[[1]]
> head(appHREF)
[1] "http://www.europarl.europa.eu/sides/getDoc.do?pubRef=-//EP//TEXT+CRE+20040504+ITEM-008+DOC+XML+V0//EN&language=en&query=INTERV&detail=2-205"
[2] "http://www.europarl.europa.eu/sides/getDoc.do?pubRef=-//EP//TEXT+CRE+20040422+ITEM-005+DOC+XML+V0//EN&language=en&query=INTERV&detail=4-069"
[3] "http://www.europarl.europa.eu/sides/getDoc.do?pubRef=-//EP//TEXT+CRE+20040422+ITEM-005+DOC+XML+V0//EN&language=en&query=INTERV&detail=4-122"
[4] "http://www.europarl.europa.eu/sides/getDoc.do?pubRef=-//EP//TEXT+CRE+20040421+ITEM-008+DOC+XML+V0//EN&language=en&query=INTERV&detail=3-207"
[5] "http://www.europarl.europa.eu/sides/getDoc.do?pubRef=-//EP//TEXT+CRE+20040330+ITEM-004+DOC+XML+V0//EN&language=en&query=INTERV&detail=2-074"
[6] "http://www.europarl.europa.eu/sides/getDoc.do?pubRef=-//EP//TEXT+CRE+20040330+ITEM-004+DOC+XML+V0//EN&language=en&query=INTERV&detail=2-099"
>
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.