简体   繁体   English

在R中使用gsub替换以模式开头的整个单词

[英]Replace the whole word that starts with a pattern using gsub in R

I'm having issues with a problem that should be so simple to resolve. 我遇到的问题应该很容易解决。 I'd like to replace the whole words in a string which start with a pattern. 我想将整个单词替换为以模式开头的字符串。

> test <- "i really wasn aware and i wasnt aware at all. but i wasn't aware. just wasn't."

    ## this is what i want
    > output
[1] "i really wasn't aware and i wasn't aware at all. but i wasn't aware. just wasn't."

the best one I've come with so far is this 到目前为止,我附带的最好的一个是

# this is what get, but it's not correct
> gsub("\\<wasn*.\\>", "wasn't", test)
[1] "i really wasn't aware and i wasn't aware at all. but i wasn't't aware. Just wasn't't."

I'm really running out of ideas. 我的想法真的耗尽了。 I would also be happy with 我也会很高兴

 # second desired output without the . at the end
    > output
    [1] "i really wasn't aware and i wasn't aware at all. but i wasn't aware. just wasn't"

Edit: it seems my question was a bit too specific. 编辑:看来我的问题有点太具体了。 so, i'm adding other test cases. 因此,我要添加其他测试用例。 Basically, i wouldn't know what character(s) would follow "wasn" and i would like to convert all to wasn't 基本上,我不知道什么字符会跟在“ wasn”之后,我想将所有字符转换为not

> test <- "i really wasn aware and i wasnt aware at all. but i wasn't aware. just wasn't. this wasn45'e meant to be. it wasn@'re simple"
> test
[1] "i really wasn aware and i wasnt aware at all. but i wasn't aware. just wasn't. this wasn45'e meant to be. it wasn@'re simple"

#desired output
> output
 [1] "i really wasn't aware and i wasn't aware at all. but i wasn't aware. just wasn't. this wasn't meant to be. it wasn't simple"

You can use a negative look ahead provided by perl.. pattern=wasn(?!')t* 您可以使用perl提供的否定pattern=wasn(?!')t*

gsub("wasn(?!')t*","wasn't",test,perl=T)
[1] "i really wasn't aware and i wasn't aware at all. but i wasn't aware. just wasn't."

or you can do: 或者您可以执行以下操作:

gsub("wasn'*t*","wasn't",test)
[1] "i really wasn't aware and i wasn't aware at all. but i wasn't aware. just wasn't."

For the second desired output: 对于第二个所需的输出:

gsub("wasn'*t*[.]?","wasn't",test)
[1] "i really wasn't aware and i wasn't aware at all. but i wasn't aware. just wasn't"

AFTER THE EDIT: 编辑后:

gsub("wasn[^. ]*","wasn't",test)
[1] "i really wasn't aware and i wasn't aware at all. but i wasn't aware. just wasn't. this wasn't meant to be. it wasn't simple"

I suggest a solution like this: 我建议这样的解决方案:

test <- c("i really wasn aware and i wasnt aware at all. but i wasn't aware. just wasn't. this wasn45'e meant to be. it wasn@'re simple", "Wasn&^$tt that nice?", "You say wasnmmmt?", "No, he wasn&#t#@$.", "She wasn%#@t##, I know.")
 gsub("\\b(wasn)\\S*\\b(?:\\S*(\\p{P})\\B)?", "\\1't\\2", test, ignore.case=TRUE, perl=TRUE)
[1] "i really wasn't aware and i wasn't aware at all. but i wasn't aware. just wasn't. this wasn't meant to be. it wasn't simple"
[2] "Wasn't that nice?"                                                                                                          
[3] "You say wasn't?"                                                                                                            
[4] "No, he wasn't."                                                                                                             
[5] "She wasn't, I know." 

See an online R demo . 参见在线R演示

This solution accounts for cases when wasn* appears at the start of the string or is capitalized, and does not replace the trailing punctuation. 此解决方案说明了wasn*出现在字符串开头或大写的情况,并且不替换结尾的标点符号。

Pattern details 图案细节

  • \\\\b - a word boundary \\\\b单词边界
  • (wasn) - Capturing group 1 (later referred to with \\\\1 in the replacement pattern): a wasn substring (case insenstive due to ignore.case=TRUE ) (wasn) -捕获组1(以后简称与\\\\1在替换模式):一个wasn子(情况insenstive由于ignore.case=TRUE
  • \\\\S*\\\\b - any 0+ chars other than whitespace followed with a word boundary \\\\S*\\\\b除空格以外的任何0+字符,后跟单词边界
  • (?:\\\\S*(\\\\p{P})\\\\B)? - an optional non-capturing group, matching 1 or 0 occurrences of -可选的非捕获组,匹配1或0次出现
    • \\\\S* - 0+ non-whitespace chars \\\\S* -0+个非空白字符
    • (\\\\p{P}) - Capturing group 2 (later referred to with \\\\2 in the replacement pattern): any 1 punctuation (not a symbol! \\p{P} is not equal to [:punct:] !) symbol not followed with... (\\\\p{P}) -捕获组2(在替换模式中后来用\\\\2 ):任意1个标点符号(不是符号! \\p{P}不等于[:punct:] !)符号后没有跟...
    • \\\\B - a letter, digit or _ (it is a non-word boundary pattern). \\\\B一个字母,数字或_ (这是一个非单词边界模式)。

For even messier strings (like She wasn%#@t##,$#^ I know. ), when the punctuation can be inside other punctuation symbols, you may restrict the punctuation you want to stop at using a custom bracket expression and adding a \\S* at the end: 对于更杂乱的字符串(例如, She wasn%#@t##,$#^ I know.当时是She wasn%#@t##,$#^ I know. ),如果标点符号可以位于其他标点符号内,则可以使用自定义括号表达式来限制要停止使用的标点符号并添加最后是\\S*

gsub("\\b(wasn)\\S*\\b(?:\\S*([?!.,:;])\\S*)?", "\\1't\\2", test, ignore.case=TRUE, perl=TRUE)

See the regex demo . 参见regex演示

Why not keep it simple and replace any word that starts with wasn with wasn't ? 为什么不保持它的简单和替换,与开头的单词wasnwasn't

test2 <- paste0(
  "i really wasn aware and i wasnt aware at all. but i wasn't aware. just",
  "wasn't. this wasn45'e meant to be. it wasn@'re simple"
)
gsub("wasn[^ ]*", "wasn't", test2)
[1] "i really wasn't aware and i wasn't aware at all. but i wasn't aware. just wasn't this wasn't meant to be. it wasn't simple"

If dealing with upper-case also then you could just add ignore.case = TRUE to gsub(). 如果还要处理大写字母,则可以在gsub()中添加ignore.case = TRUE

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM