简体   繁体   English

删除除撇号和R中的字内短划线之外的标点符号

[英]Removing punctuation except for apostrophes AND intra-word dashes in R

I know how to separately remove punctuation and keep apostrophes: 我知道如何单独删除标点并保留撇号:

gsub( "[^[:alnum:]']", " ", db$text )  

or how to keep intra-word dashes with the tm package: 或者如何使用tm包保持字内短划线:

removePunctuation(db$text, preserve_intra_word_dashes = TRUE)

but I cannot find a way to do both at the same time. 但我无法找到同时做到这两点的方法。 For example if my original sentence is: 例如,如果我的原始句子是:

"Interested in energy/the environment/etc.? Congrats to our new e-board! Ben, Nathan, Jenny, and Adam, y'all are sure to lead the club in a great direction next year! #obama #swag"

I would like it to be: 我希望它是:

"Interested in energy the environment etc Congrats to our new e-board Ben Nathan Jenny and Adam y'all are sure to lead the club in a great direction next year obama swag"

Of course, there will be extra white spaces, but I can remove them later. 当然,会有额外的空白区域,但我可以在以后删除它们。

I will be grateful for your help. 我将非常感谢你的帮助。

Use character classes 使用字符类

gsub("[^[:alnum:]['-]", " ", db$text)

## "Interested in energy the environment etc Congrats to our new e-board Ben Nathan Jenny and Adam y'all are sure to lead the club in a great direction next year obama swag"

I like David Arenberg's answer. 我喜欢David Arenberg's回答。 If you need another way, you could try: 如果您需要其他方式,您可以尝试:

library(qdap)

text <- "Interested in energy/the environment/etc.? Congrats to our new e-board! Ben, Nathan, Jenny, and Adam, y'all are sure to lead the club in a great direction next year! #obama #swag"

gsub("/", " ",strip(text, char.keep=c("-","/"), apostrophe.remove=F,lower.case=F))
#[1] "Interested in energy the environment etc Congrats to our new e-board Ben Nathan Jenny and Adam y'all are sure to lead the club in a great direction next year obama swag"

or 要么

library(gsubfn)
 clean(gsubfn("[[:punct:]]", function(x) ifelse(x=="'","'",ifelse(x=="-","-"," ")),text))
#[1] "Interested in energy the environment etc Congrats to our new e-board Ben Nathan Jenny and Adam y'all are sure to lead the club in a great direction next year obama swag"

clean is from qdap . clean来自qdap Used to remove escaped characters and space 用于删除转义的字符和空格

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM