从文本行中提取句子

Question

我有多个结构如下的句子：

text <- "09/11/2017\n                        Janssen noted September 11, 2017 that no further development planned."

我的目标是提取除空格和“ mm / dd / yyyy \\ n”之外的所有内容。到目前为止，我正在这样做：

text <-  substring(text, 20, last=100)

> text
[1] "                Janssen noted September 11, 2017 that no further development plan"

输出有点接近，除了我希望忽略文本前的空白并保持每个单词之间的空白。

在我的实际例子中：

> nchar <- nchar(df$text, type = "chars", allowNA = TRUE, keepNA = NA) # Count characters
> max(nchar,na.rm=TRUE)
[1] 81

我的最大文字长度为81。...因此，我选择了一个有意错过日期的开始，然后使最后一个时间比我的最长。

并非完美的方法。 我可以做最后一个长度（nchar）吗？

无论如何，正在寻找一种更好的解决方案。

需求输出：

"Janssen noted September 11, 2017 that no further development planned."

Answer 1

关于什么

gsub("\\d+/\\d+/\\d+\\n\\s+(.+)$", "\\1", text)

Answer 2

根据您的开始，可以使用trimws功能删除trimws的空格。

text <-  substring(text, 20, 1000000L) # what you did first
trimws(text, which = "left") # remove the leading whitespace

Answer 3

这是另一个可行的方法：

gsub("^[\\\\W\\\\d+]+(.*)","\\\\1",text, perl=TRUE)

^从字符串的开头

\\\\W匹配一个非单词字符\\\\d+匹配一个或多个数字

将这两个放在方括号中，表示匹配其中一个。

+多次匹配上述一个代词

(.*)匹配初始空格和数字之后的所有内容，并将其捕获到组1中。

我们用\\\\1返回该组