简体   繁体   中英

R regex to replace all punctuation except sentence markers, apostrophes and hyphens

I am looking for a way to mark the start and end of sentences in R. For this purpose I would like to eliminate all punctuation except for end of sentence markers such as periods, exclamation marks, interrogation marks, and hyphens, which I want to substitute with the marker ***. At the same time, I also want to preserve words containing apostrophes. To give a concrete example, given this string:

txt <- "We have examined all the possibilities, however we have not reached a solid conclusion - however we keep and open mind! Have you considered any other approach? Haven't you?"

The desired outcome would be

txt <- "We have examined all the possibilities however he have not reached a solid conclusion *** however we keep and open mind*** Have you considered any other approach*** Haven't you***"

I have not been able to come out with a regex expression to do this. Any hint is greatly appreciated.

You may use gsub.

> txt <- "We have examined all the possibilities, however he have not reached a solid conclusion - however we keep and open mind! Have you considered any other approach? Haven't you?"
> gsub("[-.?!]", "<S>", gsub("(?![-.?!'])[[:punct:]]", "", txt, perl=T))
[1] "We have examined all the possibilities however he have not reached a solid conclusion <S> however we keep and open mind<S> Have you considered any other approach<S> Haven't you<S>"
> gsub("[-.?!]", "***", gsub("(?![-.?!'])[[:punct:]]", "", txt, perl=T))
[1] "We have examined all the possibilities however he have not reached a solid conclusion *** however we keep and open mind*** Have you considered any other approach*** Haven't you***"

I would like to eliminate all punctuation except for end of sentence markers such as periods, exclamation marks, interrogation marks, and hyphens.

gsub("(?![-.?!'])[[:punct:]]", "", txt, perl=T)

which I want to substitute with the marker ***. At the same time, I also want to preserve words containing apostrophes.

gsub("[-.?!]", "***", gsub("(?![-.?!'])[[:punct:]]", "", txt, perl=T))

You can do this by using two regex. First you can remove the characters you don't want by using a character class:

[,.]
  ^--- Whatever you want to remove, put it here

And use an empty replacement string.

Then, you can use a 2nd regex like this:

[?!-]
  ^--- Add characters you want to replace here

With a replacement string:

<S>

Working demo

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM