繁体   English   中英

在R中使用gsub替换以模式开头的整个单词

[英]Replace the whole word that starts with a pattern using gsub in R

我遇到的问题应该很容易解决。 我想将整个单词替换为以模式开头的字符串。

> test <- "i really wasn aware and i wasnt aware at all. but i wasn't aware. just wasn't."

    ## this is what i want
    > output
[1] "i really wasn't aware and i wasn't aware at all. but i wasn't aware. just wasn't."

到目前为止,我附带的最好的一个是

# this is what get, but it's not correct
> gsub("\\<wasn*.\\>", "wasn't", test)
[1] "i really wasn't aware and i wasn't aware at all. but i wasn't't aware. Just wasn't't."

我的想法真的耗尽了。 我也会很高兴

 # second desired output without the . at the end
    > output
    [1] "i really wasn't aware and i wasn't aware at all. but i wasn't aware. just wasn't"

编辑:看来我的问题有点太具体了。 因此,我要添加其他测试用例。 基本上,我不知道什么字符会跟在“ wasn”之后,我想将所有字符转换为not

> test <- "i really wasn aware and i wasnt aware at all. but i wasn't aware. just wasn't. this wasn45'e meant to be. it wasn@'re simple"
> test
[1] "i really wasn aware and i wasnt aware at all. but i wasn't aware. just wasn't. this wasn45'e meant to be. it wasn@'re simple"

#desired output
> output
 [1] "i really wasn't aware and i wasn't aware at all. but i wasn't aware. just wasn't. this wasn't meant to be. it wasn't simple"

您可以使用perl提供的否定pattern=wasn(?!')t*

gsub("wasn(?!')t*","wasn't",test,perl=T)
[1] "i really wasn't aware and i wasn't aware at all. but i wasn't aware. just wasn't."

或者您可以执行以下操作:

gsub("wasn'*t*","wasn't",test)
[1] "i really wasn't aware and i wasn't aware at all. but i wasn't aware. just wasn't."

对于第二个所需的输出:

gsub("wasn'*t*[.]?","wasn't",test)
[1] "i really wasn't aware and i wasn't aware at all. but i wasn't aware. just wasn't"

编辑后:

gsub("wasn[^. ]*","wasn't",test)
[1] "i really wasn't aware and i wasn't aware at all. but i wasn't aware. just wasn't. this wasn't meant to be. it wasn't simple"

我建议这样的解决方案:

test <- c("i really wasn aware and i wasnt aware at all. but i wasn't aware. just wasn't. this wasn45'e meant to be. it wasn@'re simple", "Wasn&^$tt that nice?", "You say wasnmmmt?", "No, he wasn&#t#@$.", "She wasn%#@t##, I know.")
 gsub("\\b(wasn)\\S*\\b(?:\\S*(\\p{P})\\B)?", "\\1't\\2", test, ignore.case=TRUE, perl=TRUE)
[1] "i really wasn't aware and i wasn't aware at all. but i wasn't aware. just wasn't. this wasn't meant to be. it wasn't simple"
[2] "Wasn't that nice?"                                                                                                          
[3] "You say wasn't?"                                                                                                            
[4] "No, he wasn't."                                                                                                             
[5] "She wasn't, I know." 

参见在线R演示

此解决方案说明了wasn*出现在字符串开头或大写的情况,并且不替换结尾的标点符号。

图案细节

  • \\\\b单词边界
  • (wasn) -捕获组1(以后简称与\\\\1在替换模式):一个wasn子(情况insenstive由于ignore.case=TRUE
  • \\\\S*\\\\b除空格以外的任何0+字符,后跟单词边界
  • (?:\\\\S*(\\\\p{P})\\\\B)? -可选的非捕获组,匹配1或0次出现
    • \\\\S* -0+个非空白字符
    • (\\\\p{P}) -捕获组2(在替换模式中后来用\\\\2 ):任意1个标点符号(不是符号! \\p{P}不等于[:punct:] !)符号后没有跟...
    • \\\\B一个字母,数字或_ (这是一个非单词边界模式)。

对于更杂乱的字符串(例如, She wasn%#@t##,$#^ I know.当时是She wasn%#@t##,$#^ I know. ),如果标点符号可以位于其他标点符号内,则可以使用自定义括号表达式来限制要停止使用的标点符号并添加最后是\\S*

gsub("\\b(wasn)\\S*\\b(?:\\S*([?!.,:;])\\S*)?", "\\1't\\2", test, ignore.case=TRUE, perl=TRUE)

参见regex演示

为什么不保持它的简单和替换,与开头的单词wasnwasn't

test2 <- paste0(
  "i really wasn aware and i wasnt aware at all. but i wasn't aware. just",
  "wasn't. this wasn45'e meant to be. it wasn@'re simple"
)
gsub("wasn[^ ]*", "wasn't", test2)
[1] "i really wasn't aware and i wasn't aware at all. but i wasn't aware. just wasn't this wasn't meant to be. it wasn't simple"

如果还要处理大写字母,则可以在gsub()中添加ignore.case = TRUE

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM