[英]web scraping html in R
I want get the URL list from scraping http://obamaspeeches.com/P-Obama-Inaugural-Speech-Inauguration.htm
like this: 我想从抓取
http://obamaspeeches.com/P-Obama-Inaugural-Speech-Inauguration.htm
获取网址列表,如下所示:
[1] "P-Obama-Inaugural-Speech-Inauguration.htm"
[2] "E11-Barack-Obama-Election-Night-Victory-Speech-Grant-Park-Illinois-November-4-2008.htm"
and this is my code: 这是我的代码:
library(XML)
url = "http://obamaspeeches.com/P-Obama-Inaugural-Speech-Inauguration.htm"
doc = htmlTreeParse(url, useInternalNodes = T)
url.list = xpathSApply(doc, "//a[contains(@href, 'htm')]")
The problem is that I want to unlist()
url.list
so I can strsplit
it but it doesn't unlist
. 问题是我想
unlist()
url.list
以便我可以strsplit
它,但unlist
。
One more step ought to do it (just need to get the href
attribute): 应该再做一步(只需要获取
href
属性):
library(XML)
url <- "http://obamaspeeches.com/P-Obama-Inaugural-Speech-Inauguration.htm"
doc <- htmlTreeParse(url, useInternalNodes=TRUE)
url.list <- xpathSApply(doc, "//a[contains(@href, 'htm')]")
hrefs <- gsub("^/", "", sapply(url.list, xmlGetAttr, "href"))
head(hrefs, 6)
## [1] "P-Obama-Inaugural-Speech-Inauguration.htm"
## [2] "E11-Barack-Obama-Election-Night-Victory-Speech-Grant-Park-Illinois-November-4-2008.htm"
## [3] "E11-Barack-Obama-Election-Night-Victory-Speech-Grant-Park-Illinois-November-4-2008.htm"
## [4] "E-Barack-Obama-Speech-Manassas-Virgina-Last-Rally-2008-Election.htm"
## [5] "E10-Barack-Obama-The-American-Promise-Acceptance-Speech-at-the-Democratic-Convention-Mile-High-Stadium--Denver-Colorado-August-28-2008.htm"
## [6] "E10-Barack-Obama-The-American-Promise-Acceptance-Speech-at-the-Democratic-Convention-Mile-High-Stadium--Denver-Colorado-August-28-2008.htm"
free(doc)
UPDATE Obligatory rvest + dplyr
way: 更新强制RVest +
dplyr
方式:
library(rvest)
library(dplyr)
speeches <- html("http://obamaspeeches.com/P-Obama-Inaugural-Speech-Inauguration.htm")
speeches %>% html_nodes("a[href*=htm]") %>% html_attr("href") %>% head(6)
## same output as above
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.