在R中使用tidyr包，我们是否能够过滤和提取小标题中的链接？

Question

假设我有这个小题词，

Transcript                                                                                                                                                                                                                                                                                                                          
1 Hi i would like to find out more about <a href="https://mywebsite.com/internalfaq/faq/154200">http://mywebsite.com/internalfaq/faq/154200</a> please help
2 Hello my results were withheld at <a href="https://mywebsite.com/123">https://mywebsite.com/123</a> hope you can help
3 Hello my friend join me at https://mywebsite.com/456

我试过了

links = data %>%
    extract(Transcript, url.pattern)

但这没有给我我想要的东西。 即使我提供了url模式，也没有返回链接列表。 它仅返回第一个单词。 我做错了吗？ 提前致谢！

这是我的网址格式： https://mywebsite.com/.* : https://mywebsite.com/.*

Answer 1

所述into输入到extract必须被指定。 另外，请尝试在正则表达式中添加括号。

url.pattern <- "(https://mywebsite.com/[^> | ]*)"
data %>%
  extract(Transcript, into = 'link',regex = url.pattern)

Answer 2

您可以使用regmatches

 regmatches(h,gregexpr("http.*?(\\d+)",h))
[[1]]
[1] "https://mywebsite.com/internalfaq/faq/154200" "http://mywebsite.com/internalfaq/faq/154200" 

[[2]]
[1] "https://mywebsite.com/123" "https://mywebsite.com/123"

[[3]]
[1] "https://mywebsite.com/456"

这为您提供了整个网址。 什么是h ？ 他的Transcript[,1] 。 它是一个vector而不是dataframe 。

由于似乎网页是重复的，因此您可以使用regexpr而不是gregexpr来获得每个向量中的第一个：

regmatches(h,regexpr("http.*?(\\d+)",h))
[1] "https://mywebsite.com/internalfaq/faq/154200" "https://mywebsite.com/123"                   
[3] "https://mywebsite.com/456"

您还可以将sub函数与反向引用一起使用：

sub("(.*:)(.*\\d+)(.*)","https:\\2",h)
[1] "https://mywebsite.com/internalfaq/faq/154200" "https://mywebsite.com/123"                   
[3] "https://mywebsite.com/456"

在R中使用tidyr包，我们是否能够过滤和提取小标题中的链接？

问题描述

2 个解决方案

解决方案1
2 2018-01-26 04:15:59

解决方案2
0 2018-01-26 06:51:16

在R中使用tidyr包，我们是否能够过滤和提取小标题中的链接？

问题描述

2 个解决方案

解决方案1 2 2018-01-26 04:15:59

解决方案2 0 2018-01-26 06:51:16

解决方案1
2 2018-01-26 04:15:59

解决方案2
0 2018-01-26 06:51:16