简体   繁体   English

R网页抓取-HTML中的隐藏文本

[英]R web-scraping - hidden text in HTML

I want to scrape the urls from the following page: 我想从以下页面抓取网址:

http://www.europarl.europa.eu/meps/en/1186/seeall.html?type=CRE&leg=5 http://www.europarl.europa.eu/meps/en/1186/seeall.html?type=CRE&leg=5

There are 180 urls to be collected from this page (each is a link to a speech given in Parliament), but I am running into problems whenever there are more than 100 urls to be scraped, as the additional speeches are only accessible by clicking on the "See More" box at the bottom of the page. 此页面上将收集180个URL(每个URL都是国会演讲的链接),但是每当要删除的URL超过100个时,我都会遇到问题,因为其他语音只能通过单击来访问。页面底部的“查看更多”框。 I've tried to figure out how to reveal the additional links that I think are hidden by the "getMore" function, but with no luck! 我试图弄清楚如何显示“ getMore”功能隐藏的其他链接,但是没有运气! Apologies for naiveté here... 很抱歉在这里天真...

My current code is as follows: 我当前的代码如下:

Read in the page 在页面中阅读

mep.speech.list.url <-"http://www.europarl.europa.eu/meps/en/1186/seeall.html?type=CRE&leg=5"
speech.list.data<-try(readLines(mep.speech.list.url),silent=TRUE)

Find urls 查找网址

mep.speech.list<-speech.list.data
mep.speech.lines<-grep("href",mep.speech.list)
mep.speech.list<-mep.speech.list[mep.speech.lines]
mep.speech.lines<-grep("target",mep.speech.list)
mep.speech.list<-mep.speech.list[mep.speech.lines]
mep.speech.list<-mep.speech.list[-length(mep.speech.list)]    

Clean URLs 干净的URL

mep.speech.list.end<-regexpr("target",mep.speech.list)
mep.speech.list<-substr(mep.speech.list,1, mep.speech.list.end)
mep.speech.list<-gsub("\t","",mep.speech.list)
mep.speech.list<-gsub('<a href=\"',"",mep.speech.list)
mep.speech.list<-gsub('\" target',"",mep.speech.list)
mep.speech.list<-gsub('\" targe',"",mep.speech.list)    
mep.speech.list<-gsub('\" targ',"",mep.speech.list)
mep.speech.list<-gsub('\" tar',"",mep.speech.list)
mep.speech.list<-gsub('\" ta',"",mep.speech.list)
mep.speech.list<-gsub('\" t',"",mep.speech.list)    
mep.speech.list<-mep.speech.list[5:length(mep.speech.list)]
print(mep.speech.list)

The SEE MORE button executes some javascript that carries out an AJAX call. SEE MORE按钮执行一些执行AJAX调用的JavaScript。 You can use Selenium to automate the browser and extract the links: 您可以使用Selenium来自动化浏览器并提取链接:

require(RSelenium)
appURL <- "http://www.europarl.europa.eu/meps/en/1186/seeall.html?type=CRE&leg=5"
RSelenium::startServer()
remDr <- remoteDriver()
remDr$open()
remDr$navigate(appURL)
remDr$findElement("id", "seemore")$clickElement()
Sys.sleep(5)
jsScript <-"var hrefs = new Array();
$('#content_left .listcontent a').each(function(){
hrefs.push($(this).attr('href'));
});
return hrefs;"

appHREF <- remDr$executeScript(jsScript)[[1]]
> head(appHREF)
[1] "http://www.europarl.europa.eu/sides/getDoc.do?pubRef=-//EP//TEXT+CRE+20040504+ITEM-008+DOC+XML+V0//EN&language=en&query=INTERV&detail=2-205"
[2] "http://www.europarl.europa.eu/sides/getDoc.do?pubRef=-//EP//TEXT+CRE+20040422+ITEM-005+DOC+XML+V0//EN&language=en&query=INTERV&detail=4-069"
[3] "http://www.europarl.europa.eu/sides/getDoc.do?pubRef=-//EP//TEXT+CRE+20040422+ITEM-005+DOC+XML+V0//EN&language=en&query=INTERV&detail=4-122"
[4] "http://www.europarl.europa.eu/sides/getDoc.do?pubRef=-//EP//TEXT+CRE+20040421+ITEM-008+DOC+XML+V0//EN&language=en&query=INTERV&detail=3-207"
[5] "http://www.europarl.europa.eu/sides/getDoc.do?pubRef=-//EP//TEXT+CRE+20040330+ITEM-004+DOC+XML+V0//EN&language=en&query=INTERV&detail=2-074"
[6] "http://www.europarl.europa.eu/sides/getDoc.do?pubRef=-//EP//TEXT+CRE+20040330+ITEM-004+DOC+XML+V0//EN&language=en&query=INTERV&detail=2-099"
> 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM