[英]R programming - How do I extract the following from the string given using gregexpr?
[英]R programming - How do I remove a string that appears multiple times in a text using gregexpr?
示例:在下面的示例中,我想要实现的是删除所有以单词“Henry”开头、在句子中间包含单词“new”并以单词“pen.”结尾的句子。
text = 'Henry just bought a new black pen. Henry\'s pen costs him $2. Henry buys a new blue pen.'
我做了什么:
result = gsub(pattern='((Henry).*(new).*(pen))+',replacement='',text)
我想要达到的目标:
"Henry's pen costs him $2."
我取得了什么:
“”
我不太确定我的代码出了什么问题,有人能指出我正确的方向吗?
作为@thelatemail建议,您可以先分割text
在每个.
使用
strsplit(text, "(?<=\\.)\\s+", perl = TRUE)
其中模式"(?<=\\\\.)\\\\s+"
表示我们在 a 之后的可选空间 ( \\\\s+
) 处进行拆分.
(后视断言(?<=\\\\.)
)。 一旦我们这样做了,我们就可以检查每个句子是否符合您的标准,并过滤掉那些不符合标准的句子。 然后我们只需要再次将剩余的句子粘贴在一起:
library(magrittr)
filteredText <- strsplit(text, "(?<=\\.)\\s+", perl = TRUE)[[1]] %>%
grep(pattern = "^Henry.*new.*pen\\.$", x = ., value = TRUE, invert = TRUE) %>%
paste(collapse = " ")
#
filteredText
# [1] "Henry's pen costs him $2."
您需要按句子进行标记。 您可以通过使用带有sep = '\\\\.'
strsplit
来近似它sep = '\\\\.'
,但是随着文本缩放会失败,例如不拆分?
或分裂USA
。 不过,在这一点上,使用更好的句子分词器一点也不难,这要归功于tidytext
,它方便地将分tokenizers
包包装在一个整洁的框架中。
您可以标记为句子,然后使用正则表达式:
library(tidyverse)
library(tidytext)
text = 'Henry just bought a new black pen. Henry\'s pen costs him $2. Henry buys a new blue pen.'
data_frame(text) %>%
unnest_tokens(sentence, text, 'sentences', to_lower = FALSE) %>%
filter(!grepl('^Henry ', sentence),
!grepl('.new.{2,}', sentence),
!grepl('pen.$', sentence))
#> # A tibble: 1 x 1
#> sentence
#> <chr>
#> 1 Henry's pen costs him $2.
...或重新标记为单词以使用更基本的比较:
data_frame(text) %>%
unnest_tokens(sentence, text, 'sentences', to_lower = FALSE) %>%
unnest_tokens(word, sentence, drop = FALSE) %>%
group_by(sentence) %>%
filter(first(word) != 'henry',
!'new' %in% word,
last(word) != 'pen')
#> # A tibble: 5 x 2
#> # Groups: sentence [1]
#> sentence word
#> <chr> <chr>
#> 1 Henry's pen costs him $2. henry's
#> 2 Henry's pen costs him $2. pen
#> 3 Henry's pen costs him $2. costs
#> 4 Henry's pen costs him $2. him
#> 5 Henry's pen costs him $2. 2
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.