[英]R - How to extract items from XML Nodeset?
我有一个438个投手名称列表,看起来像这样(在XML Nodeset中):
> pitcherlinks[[1]]
<td class="left " data-append-csv="abadfe01" data-stat="player" csk="Abad,Fernando0.01">
<a href="/players/a/abadfe01.shtml">Fernando Abad</a>*
</td>
> pitcherlinks[[2]]
<td class="left " data-append-csv="adlemti01" data-stat="player" csk="Adleman,Tim0.01">
<a href="/players/a/adlemti01.shtml">Tim Adleman</a>
</td>
如何提取Fernando Abad
类的名称以及/players/a/abadfe01.shtml
类的关联链接
由于您有一个列表,因此将使用apply函数浏览该列表。 每个函数使用read_html
通过CSS选择器a
解析列表中的hmtl片段,以查找锚点(链接)。 名称来自html_text
,链接位于属性href
library(rvest)
pitcherlinks <- list()
pitcherlinks[[1]] <-
'<td class="left " data-append-csv="abadfe01" data-stat="player" csk="Abad,Fernando0.01">
<a href="/players/a/abadfe01.shtml">Fernando Abad</a>*
</td>'
pitcherlinks[[2]] <-
'<td class="left " data-append-csv="adlemti01" data-stat="player" csk="Adleman,Tim0.01">
<a href="/players/a/adlemti01.shtml">Tim Adleman</a>
</td>'
names <- sapply(pitcherlinks, function(x) {x %>% read_html() %>% html_nodes("a") %>% html_text()})
links <- sapply(pitcherlinks, function(x) {x %>% read_html() %>% html_nodes("a") %>% html_attr("href")})
names
# [1] "Fernando Abad" "Tim Adleman"
links
# [1] "/players/a/abadfe01.shtml" "/players/a/adlemti01.shtml"
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.