簡體   English   中英

除單詞中的單個撇號和連字符外,刪除所有標點符號

[英]Removing all punctuation apart from single apostrophes and hyphens within words

我之前問過一個類似的問題,但這個問題更具體,需要一個與之前提供的不同的解決方案,所以我希望可以發布它。 我只需要在我的文本中保留撇號和字內破折號(刪除所有其他標點符號)。 例如,我想從 str1 獲取 str2:

str1<-"I'm dash before word -word, dash &%$,. in-between word, two before word --word just dashes ------, between words word - word"
str2<-"I'm dash before word word dash in-between word two before word  word just dashes  between words word  word"

到目前為止,我的解決方案首先刪除了單詞之間的破折號:
gsub(" - ", " ", str1)

然后留下字母和數字字符加上剩余的破折號
gsub("[^[:alnum:]['-]", " ", str1)

問題是,它不會刪除彼此后面的破折號,例如“-”和單詞開頭和結尾的破折號:“-word”或“word-”

我認為這樣做:

gsub('( |^)-+|-+( |$)', '\\1', gsub("[^ [:alnum:]'-]", '', str1))
#[1] "I'm dash before word word dash  in-between word two before word word just dashes  between words word  word"

這是一種方法:

gsub("([[:alnum:]][[:punct:]][[:alnum:]])|[[:punct:]]", "\\1", str1)
# [1] "I'm dash before word word dash  in-between word two before word word just dashes  between words word  word"

或者,更明確地說:

gsub("([[:alnum:]]['-][[:alnum:]])|[[:punct:]]", "\\1", str1)

同樣的事情,略有不同/更短:

gsub("(\\w['-]\\w)|[[:punct:]]", "\\1", str1, perl=TRUE)

我建議

x <- "I'm dash before word -word, dash &%$,. in-between word, two before word --word just dashes ------, between words word - word"
gsub("\\b([-'])\\b|[[:punct:]]+", "\\1", x, perl=TRUE)
# =>  "I'm dash before word word dash  in-between word two before word word just dashes  between words word  word"

請參閱R 演示 正則表達式是

\b([-'])\b|[[:punct:]]+

請參閱正則表達式演示 細節:

  • \\b([-'])\\b - -'用字符(字母、數字或_ )括起來(注意:如果您只想保留在字母之間,請使用(?<=\\p{L})([-'])(?=\\p{L})代替)
  • | - 或者
  • [[:punct:]]+ - 1 個或多個標點符號。

要刪除此替換后產生的任何前導/尾隨和雙空格字符,您可以使用

res <- gsub("\\b([-'])\\b|[[:punct:]]+", "\\1", x, perl=TRUE)
res <- trimws(gsub("\\s{2,}", " ", res))

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM