[英]R regex to replace all punctuation except sentence markers, apostrophes and hyphens
[英]Removing all punctuation apart from single apostrophes and hyphens within words
我之前問過一個類似的問題,但這個問題更具體,需要一個與之前提供的不同的解決方案,所以我希望可以發布它。 我只需要在我的文本中保留撇號和字內破折號(刪除所有其他標點符號)。 例如,我想從 str1 獲取 str2:
str1<-"I'm dash before word -word, dash &%$,. in-between word, two before word --word just dashes ------, between words word - word"
str2<-"I'm dash before word word dash in-between word two before word word just dashes between words word word"
到目前為止,我的解決方案首先刪除了單詞之間的破折號:
gsub(" - ", " ", str1)
然后留下字母和數字字符加上剩余的破折號gsub("[^[:alnum:]['-]", " ", str1)
問題是,它不會刪除彼此后面的破折號,例如“-”和單詞開頭和結尾的破折號:“-word”或“word-”
我認為這樣做:
gsub('( |^)-+|-+( |$)', '\\1', gsub("[^ [:alnum:]'-]", '', str1))
#[1] "I'm dash before word word dash in-between word two before word word just dashes between words word word"
這是一種方法:
gsub("([[:alnum:]][[:punct:]][[:alnum:]])|[[:punct:]]", "\\1", str1)
# [1] "I'm dash before word word dash in-between word two before word word just dashes between words word word"
或者,更明確地說:
gsub("([[:alnum:]]['-][[:alnum:]])|[[:punct:]]", "\\1", str1)
同樣的事情,略有不同/更短:
gsub("(\\w['-]\\w)|[[:punct:]]", "\\1", str1, perl=TRUE)
我建議
x <- "I'm dash before word -word, dash &%$,. in-between word, two before word --word just dashes ------, between words word - word"
gsub("\\b([-'])\\b|[[:punct:]]+", "\\1", x, perl=TRUE)
# => "I'm dash before word word dash in-between word two before word word just dashes between words word word"
請參閱R 演示。 正則表達式是
\b([-'])\b|[[:punct:]]+
請參閱正則表達式演示。 細節:
\\b([-'])\\b
- -
或'
用字符(字母、數字或_
)括起來(注意:如果您只想保留在字母之間,請使用(?<=\\p{L})([-'])(?=\\p{L})
代替)|
- 或者[[:punct:]]+
- 1 個或多個標點符號。要刪除此替換后產生的任何前導/尾隨和雙空格字符,您可以使用
res <- gsub("\\b([-'])\\b|[[:punct:]]+", "\\1", x, perl=TRUE)
res <- trimws(gsub("\\s{2,}", " ", res))
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.