![](/img/trans.png)
[英]Extract Links from Excel Sheet using R-package 'openxlsx'
[英]Using tidyr package in R, are we able to filter and extract links from a tibble?
假设我有这个小题词,
Transcript
1 Hi i would like to find out more about <a href="https://mywebsite.com/internalfaq/faq/154200">http://mywebsite.com/internalfaq/faq/154200</a> please help
2 Hello my results were withheld at <a href="https://mywebsite.com/123">https://mywebsite.com/123</a> hope you can help
3 Hello my friend join me at https://mywebsite.com/456
我试过了
links = data %>%
extract(Transcript, url.pattern)
但这没有给我我想要的东西。 即使我提供了url模式,也没有返回链接列表。 它仅返回第一个单词。 我做错了吗? 提前致谢!
这是我的网址格式: https://mywebsite.com/.*
: https://mywebsite.com/.*
所述into
输入到extract
必须被指定。 另外,请尝试在正则表达式中添加括号。
url.pattern <- "(https://mywebsite.com/[^> | ]*)"
data %>%
extract(Transcript, into = 'link',regex = url.pattern)
您可以使用regmatches
regmatches(h,gregexpr("http.*?(\\d+)",h))
[[1]]
[1] "https://mywebsite.com/internalfaq/faq/154200" "http://mywebsite.com/internalfaq/faq/154200"
[[2]]
[1] "https://mywebsite.com/123" "https://mywebsite.com/123"
[[3]]
[1] "https://mywebsite.com/456"
这为您提供了整个网址。 什么是h
? 他的Transcript[,1]
。 它是一个vector
而不是dataframe
。
由于似乎网页是重复的,因此您可以使用regexpr
而不是gregexpr
来获得每个向量中的第一个:
regmatches(h,regexpr("http.*?(\\d+)",h))
[1] "https://mywebsite.com/internalfaq/faq/154200" "https://mywebsite.com/123"
[3] "https://mywebsite.com/456"
您还可以将sub
函数与反向引用一起使用:
sub("(.*:)(.*\\d+)(.*)","https:\\2",h)
[1] "https://mywebsite.com/internalfaq/faq/154200" "https://mywebsite.com/123"
[3] "https://mywebsite.com/456"
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.