R中的网页抓取html

Question

I want get the URL list from scraping http://obamaspeeches.com/P-Obama-Inaugural-Speech-Inauguration.htm like this: 我想从抓取http://obamaspeeches.com/P-Obama-Inaugural-Speech-Inauguration.htm获取网址列表，如下所示：

[1] "P-Obama-Inaugural-Speech-Inauguration.htm"
[2] "E11-Barack-Obama-Election-Night-Victory-Speech-Grant-Park-Illinois-November-4-2008.htm"

and this is my code: 这是我的代码：

library(XML)

url = "http://obamaspeeches.com/P-Obama-Inaugural-Speech-Inauguration.htm"
doc = htmlTreeParse(url, useInternalNodes = T)
url.list = xpathSApply(doc, "//a[contains(@href, 'htm')]")

The problem is that I want to unlist() url.list so I can strsplit it but it doesn't unlist . 问题是我想unlist() url.list以便我可以strsplit它，但unlist 。

Answer 1

One more step ought to do it (just need to get the href attribute): 应该再做一步（只需要获取href属性）：

library(XML)

url <- "http://obamaspeeches.com/P-Obama-Inaugural-Speech-Inauguration.htm"
doc <- htmlTreeParse(url, useInternalNodes=TRUE)

url.list <- xpathSApply(doc, "//a[contains(@href, 'htm')]")
hrefs <- gsub("^/", "", sapply(url.list, xmlGetAttr, "href"))

head(hrefs, 6)

## [1] "P-Obama-Inaugural-Speech-Inauguration.htm"                                                                                                 
## [2] "E11-Barack-Obama-Election-Night-Victory-Speech-Grant-Park-Illinois-November-4-2008.htm"                                                    
## [3] "E11-Barack-Obama-Election-Night-Victory-Speech-Grant-Park-Illinois-November-4-2008.htm"                                                    
## [4] "E-Barack-Obama-Speech-Manassas-Virgina-Last-Rally-2008-Election.htm"                                                                       
## [5] "E10-Barack-Obama-The-American-Promise-Acceptance-Speech-at-the-Democratic-Convention-Mile-High-Stadium--Denver-Colorado-August-28-2008.htm"
## [6] "E10-Barack-Obama-The-American-Promise-Acceptance-Speech-at-the-Democratic-Convention-Mile-High-Stadium--Denver-Colorado-August-28-2008.htm"

free(doc)

UPDATE Obligatory rvest + dplyr way: 更新强制RVest + dplyr方式：

library(rvest)
library(dplyr)

speeches <- html("http://obamaspeeches.com/P-Obama-Inaugural-Speech-Inauguration.htm")
speeches %>% html_nodes("a[href*=htm]") %>% html_attr("href") %>% head(6)

## same output as above

R中的网页抓取html

问题描述

1 个解决方案

解决方案1
4 2014-04-03 12:42:42

R中的网页抓取html

问题描述

1 个解决方案

解决方案1 4 2014-04-03 12:42:42

解决方案1
4 2014-04-03 12:42:42