R web-scraping - hidden text in HTML

Question

I want to scrape the urls from the following page:

http://www.europarl.europa.eu/meps/en/1186/seeall.html?type=CRE&leg=5

There are 180 urls to be collected from this page (each is a link to a speech given in Parliament), but I am running into problems whenever there are more than 100 urls to be scraped, as the additional speeches are only accessible by clicking on the "See More" box at the bottom of the page. I've tried to figure out how to reveal the additional links that I think are hidden by the "getMore" function, but with no luck! Apologies for naiveté here...

My current code is as follows:

Read in the page

mep.speech.list.url <-"http://www.europarl.europa.eu/meps/en/1186/seeall.html?type=CRE&leg=5"
speech.list.data<-try(readLines(mep.speech.list.url),silent=TRUE)

Find urls

mep.speech.list<-speech.list.data
mep.speech.lines<-grep("href",mep.speech.list)
mep.speech.list<-mep.speech.list[mep.speech.lines]
mep.speech.lines<-grep("target",mep.speech.list)
mep.speech.list<-mep.speech.list[mep.speech.lines]
mep.speech.list<-mep.speech.list[-length(mep.speech.list)]

Clean URLs

mep.speech.list.end<-regexpr("target",mep.speech.list)
mep.speech.list<-substr(mep.speech.list,1, mep.speech.list.end)
mep.speech.list<-gsub("\t","",mep.speech.list)
mep.speech.list<-gsub('<a href=\"',"",mep.speech.list)
mep.speech.list<-gsub('\" target',"",mep.speech.list)
mep.speech.list<-gsub('\" targe',"",mep.speech.list)    
mep.speech.list<-gsub('\" targ',"",mep.speech.list)
mep.speech.list<-gsub('\" tar',"",mep.speech.list)
mep.speech.list<-gsub('\" ta',"",mep.speech.list)
mep.speech.list<-gsub('\" t',"",mep.speech.list)    
mep.speech.list<-mep.speech.list[5:length(mep.speech.list)]
print(mep.speech.list)

Answer 1

The SEE MORE button executes some javascript that carries out an AJAX call. You can use Selenium to automate the browser and extract the links:

require(RSelenium)
appURL <- "http://www.europarl.europa.eu/meps/en/1186/seeall.html?type=CRE&leg=5"
RSelenium::startServer()
remDr <- remoteDriver()
remDr$open()
remDr$navigate(appURL)
remDr$findElement("id", "seemore")$clickElement()
Sys.sleep(5)
jsScript <-"var hrefs = new Array();
$('#content_left .listcontent a').each(function(){
hrefs.push($(this).attr('href'));
});
return hrefs;"

appHREF <- remDr$executeScript(jsScript)[[1]]
> head(appHREF)
[1] "http://www.europarl.europa.eu/sides/getDoc.do?pubRef=-//EP//TEXT+CRE+20040504+ITEM-008+DOC+XML+V0//EN&language=en&query=INTERV&detail=2-205"
[2] "http://www.europarl.europa.eu/sides/getDoc.do?pubRef=-//EP//TEXT+CRE+20040422+ITEM-005+DOC+XML+V0//EN&language=en&query=INTERV&detail=4-069"
[3] "http://www.europarl.europa.eu/sides/getDoc.do?pubRef=-//EP//TEXT+CRE+20040422+ITEM-005+DOC+XML+V0//EN&language=en&query=INTERV&detail=4-122"
[4] "http://www.europarl.europa.eu/sides/getDoc.do?pubRef=-//EP//TEXT+CRE+20040421+ITEM-008+DOC+XML+V0//EN&language=en&query=INTERV&detail=3-207"
[5] "http://www.europarl.europa.eu/sides/getDoc.do?pubRef=-//EP//TEXT+CRE+20040330+ITEM-004+DOC+XML+V0//EN&language=en&query=INTERV&detail=2-074"
[6] "http://www.europarl.europa.eu/sides/getDoc.do?pubRef=-//EP//TEXT+CRE+20040330+ITEM-004+DOC+XML+V0//EN&language=en&query=INTERV&detail=2-099"
>

R web-scraping - hidden text in HTML

Question

Read in the page

Find urls

Clean URLs

1 answers

solution1
3 ACCPTED 2014-04-14 14:16:38

R web-scraping - hidden text in HTML

Question

Read in the page

Find urls

Clean URLs

1 answers

solution1 3 ACCPTED 2014-04-14 14:16:38

solution1
3 ACCPTED 2014-04-14 14:16:38