使用正则表达式从R中的pdf提取链接

Question

I am trying to clean a list of pdfs of links. 我正在尝试清理链接的pdf列表。 I want to include this in my cleaning function and therefore use regexes. 我想将此包含在清理功能中，因此使用正则表达式。 And yes, I spend more time than I like to admit googling and browsing though questions here. 是的，我花了比我更愿意花时间搜索和浏览问题的时间。 My pdfs are split into lines, so it is not one consecutive string. 我的pdf分为几行，因此它不是一个连续的字符串。 I have a piece of code that gives me only one link as result (even though there should be many). 我有一段代码只给了我一个链接（即使应该有很多链接）。 All other options I tried included a lot of text I want to keep in my dataset. 我尝试过的所有其他选项都包含很多我想保留在数据集中的文本。

I have tried multiple options outside my function but they will not run on texts, only on examples. 我已经在功能之外尝试了多个选项，但是它们不能在文本上运行，而只能在示例上运行。

I want to catch everything from the www to the first white space after all the things that come after the .org or .html or whatever (eg /questions/ask/somethingelse 我想从.org或.html之后的所有内容（例如/ questions / ask / somethingelse）后面捕获从www到第一个空白的所有内容

I tried simulating some things 我尝试模拟一些东西

w <- "www.smthing.org/knowledge/school/principal.\r"
z <- "www.oecd.de\r"
x <- "www.bla.pdfwerr\r .irgendwas" # should not catch that, too many characters after the . 
m <-  "           www.cognitioninstitute.org/index.php/Publications/ 
 bla test smth 
  .gtw, www.stmthing-else.html.\r"
n <- "decoy"


l <- list(w,z,x,m,n)

regmatches(l, regexpr("w{3}\\.[a-z]*\\.[a-z]{2,4}.*?[[:space:]]", l))

My current working state also only catches the first occurence in that particular line, instead stopping at the space (line m in my example) and then including the next link as well. 我当前的工作状态也只捕获到该特定行中的第一个匹配项，而不是停在空格处（在我的示例中为m行），然后还包括下一个链接。

Answer 1

You may use 您可以使用

regmatches(l, gregexpr("w{3}\\.\\S*\\b", l))

The gregexpr function will let you extract all occurrences of the pattern. 使用gregexpr函数可以提取所有出现的模式。

Note that most users prefer spelling out www instead of using w{3} . 请注意，大多数用户更喜欢拼写www而不是使用w{3} 。

Pattern details 图案细节

w{3} - three w chars w{3} -三个w字符
\\\\. - a dot -一个点
\\\\S* - zero or more non-whitespace chars \\\\S* -零个或多个非空白字符
\\\\b - word boundary. \\\\b单词边界。

使用正则表达式从R中的pdf提取链接

问题描述

1 个解决方案

解决方案1
2 已采纳 2019-07-22 14:28:07

使用正则表达式从R中的pdf提取链接

问题描述

1 个解决方案

解决方案1 2 已采纳 2019-07-22 14:28:07

解决方案1
2 已采纳 2019-07-22 14:28:07