繁体   English   中英

R 编程 - 如何使用 gregexpr 删除在文本中多次出现的字符串?

[英]R programming - How do I remove a string that appears multiple times in a text using gregexpr?

示例:在下面的示例中,我想要实现的是删除所有以单词“Henry”开头、在句子中间包含单词“new”并以单词“pen.”结尾的句子。

text = 'Henry just bought a new black pen. Henry\'s pen costs him $2. Henry buys a new blue pen.'

我做了什么:

result = gsub(pattern='((Henry).*(new).*(pen))+',replacement='',text)

我想要达到的目标:

"Henry's pen costs him $2."

我取得了什么:

“”

我不太确定我的代码出了什么问题,有人能指出我正确的方向吗?

作为@thelatemail建议,您可以先分割text在每个. 使用

strsplit(text, "(?<=\\.)\\s+", perl = TRUE)

其中模式"(?<=\\\\.)\\\\s+"表示我们在 a 之后的可选空间 ( \\\\s+ ) 处进行拆分. (后视断言(?<=\\\\.) )。 一旦我们这样做了,我们就可以检查每个句子是否符合您的标准,并过滤掉那些不符合标准的句子。 然后我们只需要再次将剩余的句子粘贴在一起:

library(magrittr)
filteredText <- strsplit(text, "(?<=\\.)\\s+", perl = TRUE)[[1]] %>%
        grep(pattern = "^Henry.*new.*pen\\.$", x = ., value = TRUE, invert = TRUE) %>%
        paste(collapse = " ")
# 
filteredText
# [1] "Henry's pen costs him $2."

您需要按句子进行标记。 您可以通过使用带有sep = '\\\\.' strsplit来近似它sep = '\\\\.' ,但是随着文本缩放会失败,例如不拆分? 或分裂USA 不过,在这一点上,使用更好的句子分词器一点也不难,这要归功于tidytext ,它方便地将分tokenizers包包装在一个整洁的框架中。

您可以标记为句子,然后使用正则表达式:

library(tidyverse)
library(tidytext)

text = 'Henry just bought a new black pen. Henry\'s pen costs him $2. Henry buys a new blue pen.'

data_frame(text) %>% 
    unnest_tokens(sentence, text, 'sentences', to_lower = FALSE) %>% 
    filter(!grepl('^Henry ', sentence), 
           !grepl('.new.{2,}', sentence),
           !grepl('pen.$', sentence))
#> # A tibble: 1 x 1
#>                    sentence
#>                       <chr>
#> 1 Henry's pen costs him $2.

...或重新标记为单词以使用更基本的比较:

data_frame(text) %>% 
    unnest_tokens(sentence, text, 'sentences', to_lower = FALSE) %>% 
    unnest_tokens(word, sentence, drop = FALSE) %>% 
    group_by(sentence) %>% 
    filter(first(word) != 'henry',
           !'new' %in% word,
           last(word) != 'pen')
#> # A tibble: 5 x 2
#> # Groups:   sentence [1]
#>                    sentence    word
#>                       <chr>   <chr>
#> 1 Henry's pen costs him $2. henry's
#> 2 Henry's pen costs him $2.     pen
#> 3 Henry's pen costs him $2.   costs
#> 4 Henry's pen costs him $2.     him
#> 5 Henry's pen costs him $2.       2

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM