![](/img/trans.png)
[英]R regex to replace all punctuation except sentence markers, apostrophes and hyphens
[英]Remove all punctuation except apostrophes in R
我想使用 R 的 gsub 从文本中删除除撇号之外的所有标点符号。 我对正则表达式相当陌生,但正在学习。
例子:
x <- "I like %$@to*&, chew;: gum, but don't like|}{[] bubble@#^)( gum!?"
gsub("[[:punct:]]", "", as.character(x))
电流输出(不要中没有撇号)
[1] "I like to chew gum but dont like bubble gum"
所需的输出(我希望不要留下撇号)
[1] "I like to chew gum but don't like bubble gum"
x <- "I like %$@to*&, chew;: gum, but don't like|}{[] bubble@#^)( gum!?"
gsub("[^[:alnum:][:space:]']", "", x)
[1] "I like to chew gum but don't like bubble gum"
上面的正则表达式更加直接。 它将所有不是字母数字符号、空格或撇号(插入符号!)的内容替换为空字符串。
您可以使用双重否定从 POSIX 类punct
排除撇号:
[^'[:^punct:]]
代码:
x <- "I like %$@to*&, chew;: gum, but don't like|}{[] bubble@#^)( gum!?"
gsub("[^'[:^punct:]]", "", x, perl=T)
#[1] "I like to chew gum but don't like bubble gum"
下面是一个例子:
> gsub("(.*?)($|'|[^[:punct:]]+?)(.*?)", "\\2", x)
[1] "I like to chew gum but don't like bubble gum"
主要是为了多样性,这里有一个使用gsubfn()
来自同名的极好的包的解决方案。 在这个应用程序中,我只是喜欢它所允许的解决方案的表现力:
library(gsubfn)
gsubfn(pattern = "[[:punct:]]", engine = "R",
replacement = function(x) ifelse(x == "'", "'", ""),
x)
[1] "I like to chew gum but don't like bubble gum"
(这里需要参数engine = "R"
否则将使用默认的 tcl 引擎。它匹配正则表达式的规则略有不同:例如,如果它用于处理上面的字符串,则需要改为设置pattern = "[[:punct:]$|^]"
。感谢 G. Grothendieck 指出这个细节。)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.