简体   繁体   中英

Replace the whole word that starts with a pattern using gsub in R

I'm having issues with a problem that should be so simple to resolve. I'd like to replace the whole words in a string which start with a pattern.

> test <- "i really wasn aware and i wasnt aware at all. but i wasn't aware. just wasn't."

    ## this is what i want
    > output
[1] "i really wasn't aware and i wasn't aware at all. but i wasn't aware. just wasn't."

the best one I've come with so far is this

# this is what get, but it's not correct
> gsub("\\<wasn*.\\>", "wasn't", test)
[1] "i really wasn't aware and i wasn't aware at all. but i wasn't't aware. Just wasn't't."

I'm really running out of ideas. I would also be happy with

 # second desired output without the . at the end
    > output
    [1] "i really wasn't aware and i wasn't aware at all. but i wasn't aware. just wasn't"

Edit: it seems my question was a bit too specific. so, i'm adding other test cases. Basically, i wouldn't know what character(s) would follow "wasn" and i would like to convert all to wasn't

> test <- "i really wasn aware and i wasnt aware at all. but i wasn't aware. just wasn't. this wasn45'e meant to be. it wasn@'re simple"
> test
[1] "i really wasn aware and i wasnt aware at all. but i wasn't aware. just wasn't. this wasn45'e meant to be. it wasn@'re simple"

#desired output
> output
 [1] "i really wasn't aware and i wasn't aware at all. but i wasn't aware. just wasn't. this wasn't meant to be. it wasn't simple"

You can use a negative look ahead provided by perl.. pattern=wasn(?!')t*

gsub("wasn(?!')t*","wasn't",test,perl=T)
[1] "i really wasn't aware and i wasn't aware at all. but i wasn't aware. just wasn't."

or you can do:

gsub("wasn'*t*","wasn't",test)
[1] "i really wasn't aware and i wasn't aware at all. but i wasn't aware. just wasn't."

For the second desired output:

gsub("wasn'*t*[.]?","wasn't",test)
[1] "i really wasn't aware and i wasn't aware at all. but i wasn't aware. just wasn't"

AFTER THE EDIT:

gsub("wasn[^. ]*","wasn't",test)
[1] "i really wasn't aware and i wasn't aware at all. but i wasn't aware. just wasn't. this wasn't meant to be. it wasn't simple"

I suggest a solution like this:

test <- c("i really wasn aware and i wasnt aware at all. but i wasn't aware. just wasn't. this wasn45'e meant to be. it wasn@'re simple", "Wasn&^$tt that nice?", "You say wasnmmmt?", "No, he wasn&#t#@$.", "She wasn%#@t##, I know.")
 gsub("\\b(wasn)\\S*\\b(?:\\S*(\\p{P})\\B)?", "\\1't\\2", test, ignore.case=TRUE, perl=TRUE)
[1] "i really wasn't aware and i wasn't aware at all. but i wasn't aware. just wasn't. this wasn't meant to be. it wasn't simple"
[2] "Wasn't that nice?"                                                                                                          
[3] "You say wasn't?"                                                                                                            
[4] "No, he wasn't."                                                                                                             
[5] "She wasn't, I know." 

See an online R demo .

This solution accounts for cases when wasn* appears at the start of the string or is capitalized, and does not replace the trailing punctuation.

Pattern details

  • \\\\b - a word boundary
  • (wasn) - Capturing group 1 (later referred to with \\\\1 in the replacement pattern): a wasn substring (case insenstive due to ignore.case=TRUE )
  • \\\\S*\\\\b - any 0+ chars other than whitespace followed with a word boundary
  • (?:\\\\S*(\\\\p{P})\\\\B)? - an optional non-capturing group, matching 1 or 0 occurrences of
    • \\\\S* - 0+ non-whitespace chars
    • (\\\\p{P}) - Capturing group 2 (later referred to with \\\\2 in the replacement pattern): any 1 punctuation (not a symbol! \\p{P} is not equal to [:punct:] !) symbol not followed with...
    • \\\\B - a letter, digit or _ (it is a non-word boundary pattern).

For even messier strings (like She wasn%#@t##,$#^ I know. ), when the punctuation can be inside other punctuation symbols, you may restrict the punctuation you want to stop at using a custom bracket expression and adding a \\S* at the end:

gsub("\\b(wasn)\\S*\\b(?:\\S*([?!.,:;])\\S*)?", "\\1't\\2", test, ignore.case=TRUE, perl=TRUE)

See the regex demo .

Why not keep it simple and replace any word that starts with wasn with wasn't ?

test2 <- paste0(
  "i really wasn aware and i wasnt aware at all. but i wasn't aware. just",
  "wasn't. this wasn45'e meant to be. it wasn@'re simple"
)
gsub("wasn[^ ]*", "wasn't", test2)
[1] "i really wasn't aware and i wasn't aware at all. but i wasn't aware. just wasn't this wasn't meant to be. it wasn't simple"

If dealing with upper-case also then you could just add ignore.case = TRUE to gsub().

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM