简体   繁体   English

R正则表达式可替换除句子标记,撇号和连字符以外的所有标点符号

[英]R regex to replace all punctuation except sentence markers, apostrophes and hyphens

I am looking for a way to mark the start and end of sentences in R. For this purpose I would like to eliminate all punctuation except for end of sentence markers such as periods, exclamation marks, interrogation marks, and hyphens, which I want to substitute with the marker ***. 我正在寻找一种在R中标记句子开头和结尾的方法。为此,我想消除所有句子标点符号,例如句号,感叹号,询问符和连字符,这是我想用标记***代替。 At the same time, I also want to preserve words containing apostrophes. 同时,我还想保留包含撇号的单词。 To give a concrete example, given this string: 给一个具体的例子,给出以下字符串:

txt <- "We have examined all the possibilities, however we have not reached a solid conclusion - however we keep and open mind! Have you considered any other approach? Haven't you?"

The desired outcome would be 理想的结果是

txt <- "We have examined all the possibilities however he have not reached a solid conclusion *** however we keep and open mind*** Have you considered any other approach*** Haven't you***"

I have not been able to come out with a regex expression to do this. 我还没有出来一个正则表达式来做到这一点。 Any hint is greatly appreciated. 任何提示,不胜感激。

You may use gsub. 您可以使用gsub。

> txt <- "We have examined all the possibilities, however he have not reached a solid conclusion - however we keep and open mind! Have you considered any other approach? Haven't you?"
> gsub("[-.?!]", "<S>", gsub("(?![-.?!'])[[:punct:]]", "", txt, perl=T))
[1] "We have examined all the possibilities however he have not reached a solid conclusion <S> however we keep and open mind<S> Have you considered any other approach<S> Haven't you<S>"
> gsub("[-.?!]", "***", gsub("(?![-.?!'])[[:punct:]]", "", txt, perl=T))
[1] "We have examined all the possibilities however he have not reached a solid conclusion *** however we keep and open mind*** Have you considered any other approach*** Haven't you***"

I would like to eliminate all punctuation except for end of sentence markers such as periods, exclamation marks, interrogation marks, and hyphens. 除了句末标记,例如句号,感叹号,审问标记和连字符以外,我想消除所有标点符号。

gsub("(?![-.?!'])[[:punct:]]", "", txt, perl=T)

which I want to substitute with the marker ***. 我想用标记***代替。 At the same time, I also want to preserve words containing apostrophes. 同时,我还想保留包含撇号的单词。

gsub("[-.?!]", "***", gsub("(?![-.?!'])[[:punct:]]", "", txt, perl=T))

You can do this by using two regex. 您可以通过使用两个正则表达式来做到这一点。 First you can remove the characters you don't want by using a character class: 首先,您可以使用字符类来删除不需要的字符:

[,.]
  ^--- Whatever you want to remove, put it here

And use an empty replacement string. 并使用空的替换字符串。

Then, you can use a 2nd regex like this: 然后,您可以使用第二个正则表达式,如下所示:

[?!-]
  ^--- Add characters you want to replace here

With a replacement string: 用替换字符串:

<S>

Working demo 工作演示

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM