如何使用R从HTML提取包含破折号的URL？

Question

我有一些看起来像这样的HTML：

<ul><li><a href="http://www.website.com/index.aspx" target="_blank">Website</a></li>
<li><a href="http://website.com/index.html" target="_blank">Website</a></li>
<li><a href="http://www.website-with-dashes.org" target="_blank">Website With Dashes</a></li>
<li><a href="http://website2.org/index.htm" target="_blank">Website 2</a></li>
<li><a href="http://www.another-site.com/">Another Site</a></li>

运用

m<-regexpr("http://\\S*/?", links, perl=T)
links<-regmatches(links, m)

获取链接，但其中带有破折号的链接会被截断，如下所示：

http://www.website.com/index.aspx
http://website.com/index.html
http://www.website
http://website2.org/index.htm
http://www.another-site.com/

我以为/ S与任何非空白匹配。 这是怎么回事？

Answer 1

使用XML::getHTMLlinks

例如

library(XML)
# assuming your html document is'foo.html')

 getHTMLLinks(doc = 'foo.html')
# [1] "http://www.website.com/index.aspx"  "http://website.com/index.html"      "http://www.website-with-dashes.org"
# [4] "http://website2.org/index.htm"      "http://www.another-site.com/"

用正则表达式解析HTML不一定很简单。 https://stackoverflow.com/a/1732454/1385941很有趣。

如何使用R从HTML提取包含破折号的URL？

问题描述

1 个解决方案

解决方案1
4 已采纳 2013-08-22 06:08:29

如何使用R从HTML提取包含破折号的URL？

问题描述

1 个解决方案

解决方案1 4 已采纳 2013-08-22 06:08:29

解决方案1
4 已采纳 2013-08-22 06:08:29