簡體   English   中英

在R中使用gsub替換以模式開頭的整個單詞

[英]Replace the whole word that starts with a pattern using gsub in R

我遇到的問題應該很容易解決。 我想將整個單詞替換為以模式開頭的字符串。

> test <- "i really wasn aware and i wasnt aware at all. but i wasn't aware. just wasn't."

    ## this is what i want
    > output
[1] "i really wasn't aware and i wasn't aware at all. but i wasn't aware. just wasn't."

到目前為止,我附帶的最好的一個是

# this is what get, but it's not correct
> gsub("\\<wasn*.\\>", "wasn't", test)
[1] "i really wasn't aware and i wasn't aware at all. but i wasn't't aware. Just wasn't't."

我的想法真的耗盡了。 我也會很高興

 # second desired output without the . at the end
    > output
    [1] "i really wasn't aware and i wasn't aware at all. but i wasn't aware. just wasn't"

編輯:看來我的問題有點太具體了。 因此,我要添加其他測試用例。 基本上,我不知道什么字符會跟在“ wasn”之后,我想將所有字符轉換為not

> test <- "i really wasn aware and i wasnt aware at all. but i wasn't aware. just wasn't. this wasn45'e meant to be. it wasn@'re simple"
> test
[1] "i really wasn aware and i wasnt aware at all. but i wasn't aware. just wasn't. this wasn45'e meant to be. it wasn@'re simple"

#desired output
> output
 [1] "i really wasn't aware and i wasn't aware at all. but i wasn't aware. just wasn't. this wasn't meant to be. it wasn't simple"

您可以使用perl提供的否定pattern=wasn(?!')t*

gsub("wasn(?!')t*","wasn't",test,perl=T)
[1] "i really wasn't aware and i wasn't aware at all. but i wasn't aware. just wasn't."

或者您可以執行以下操作:

gsub("wasn'*t*","wasn't",test)
[1] "i really wasn't aware and i wasn't aware at all. but i wasn't aware. just wasn't."

對於第二個所需的輸出:

gsub("wasn'*t*[.]?","wasn't",test)
[1] "i really wasn't aware and i wasn't aware at all. but i wasn't aware. just wasn't"

編輯后:

gsub("wasn[^. ]*","wasn't",test)
[1] "i really wasn't aware and i wasn't aware at all. but i wasn't aware. just wasn't. this wasn't meant to be. it wasn't simple"

我建議這樣的解決方案:

test <- c("i really wasn aware and i wasnt aware at all. but i wasn't aware. just wasn't. this wasn45'e meant to be. it wasn@'re simple", "Wasn&^$tt that nice?", "You say wasnmmmt?", "No, he wasn&#t#@$.", "She wasn%#@t##, I know.")
 gsub("\\b(wasn)\\S*\\b(?:\\S*(\\p{P})\\B)?", "\\1't\\2", test, ignore.case=TRUE, perl=TRUE)
[1] "i really wasn't aware and i wasn't aware at all. but i wasn't aware. just wasn't. this wasn't meant to be. it wasn't simple"
[2] "Wasn't that nice?"                                                                                                          
[3] "You say wasn't?"                                                                                                            
[4] "No, he wasn't."                                                                                                             
[5] "She wasn't, I know." 

參見在線R演示

此解決方案說明了wasn*出現在字符串開頭或大寫的情況,並且不替換結尾的標點符號。

圖案細節

  • \\\\b單詞邊界
  • (wasn) -捕獲組1(以后簡稱與\\\\1在替換模式):一個wasn子(情況insenstive由於ignore.case=TRUE
  • \\\\S*\\\\b除空格以外的任何0+字符,后跟單詞邊界
  • (?:\\\\S*(\\\\p{P})\\\\B)? -可選的非捕獲組,匹配1或0次出現
    • \\\\S* -0+個非空白字符
    • (\\\\p{P}) -捕獲組2(在替換模式中后來用\\\\2 ):任意1個標點符號(不是符號! \\p{P}不等於[:punct:] !)符號后沒有跟...
    • \\\\B一個字母,數字或_ (這是一個非單詞邊界模式)。

對於更雜亂的字符串(例如, She wasn%#@t##,$#^ I know.當時是She wasn%#@t##,$#^ I know. ),如果標點符號可以位於其他標點符號內,則可以使用自定義括號表達式來限制要停止使用的標點符號並添加最后是\\S*

gsub("\\b(wasn)\\S*\\b(?:\\S*([?!.,:;])\\S*)?", "\\1't\\2", test, ignore.case=TRUE, perl=TRUE)

參見regex演示

為什么不保持它的簡單和替換,與開頭的單詞wasnwasn't

test2 <- paste0(
  "i really wasn aware and i wasnt aware at all. but i wasn't aware. just",
  "wasn't. this wasn45'e meant to be. it wasn@'re simple"
)
gsub("wasn[^ ]*", "wasn't", test2)
[1] "i really wasn't aware and i wasn't aware at all. but i wasn't aware. just wasn't this wasn't meant to be. it wasn't simple"

如果還要處理大寫字母,則可以在gsub()中添加ignore.case = TRUE

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM