[英]Removing punctuation except for apostrophes AND intra-word dashes in R
I know how to separately remove punctuation and keep apostrophes: 我知道如何单独删除标点并保留撇号:
gsub( "[^[:alnum:]']", " ", db$text )
or how to keep intra-word dashes with the tm package: 或者如何使用tm包保持字内短划线:
removePunctuation(db$text, preserve_intra_word_dashes = TRUE)
but I cannot find a way to do both at the same time. 但我无法找到同时做到这两点的方法。 For example if my original sentence is: 例如,如果我的原始句子是:
"Interested in energy/the environment/etc.? Congrats to our new e-board! Ben, Nathan, Jenny, and Adam, y'all are sure to lead the club in a great direction next year! #obama #swag"
I would like it to be: 我希望它是:
"Interested in energy the environment etc Congrats to our new e-board Ben Nathan Jenny and Adam y'all are sure to lead the club in a great direction next year obama swag"
Of course, there will be extra white spaces, but I can remove them later. 当然,会有额外的空白区域,但我可以在以后删除它们。
I will be grateful for your help. 我将非常感谢你的帮助。
Use character classes 使用字符类
gsub("[^[:alnum:]['-]", " ", db$text)
## "Interested in energy the environment etc Congrats to our new e-board Ben Nathan Jenny and Adam y'all are sure to lead the club in a great direction next year obama swag"
I like David Arenberg's
answer. 我喜欢David Arenberg's
回答。 If you need another way, you could try: 如果您需要其他方式,您可以尝试:
library(qdap)
text <- "Interested in energy/the environment/etc.? Congrats to our new e-board! Ben, Nathan, Jenny, and Adam, y'all are sure to lead the club in a great direction next year! #obama #swag"
gsub("/", " ",strip(text, char.keep=c("-","/"), apostrophe.remove=F,lower.case=F))
#[1] "Interested in energy the environment etc Congrats to our new e-board Ben Nathan Jenny and Adam y'all are sure to lead the club in a great direction next year obama swag"
or 要么
library(gsubfn)
clean(gsubfn("[[:punct:]]", function(x) ifelse(x=="'","'",ifelse(x=="-","-"," ")),text))
#[1] "Interested in energy the environment etc Congrats to our new e-board Ben Nathan Jenny and Adam y'all are sure to lead the club in a great direction next year obama swag"
clean
is from qdap
. clean
来自qdap
。 Used to remove escaped characters and space 用于删除转义的字符和空格
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.