I am processing a lot of old text material. Quite often the OCR process puts a "." in a word, for example "this is a test." I want to replace these dots with an empty space "". But I do not want to get rid of the dots that indicate the end of a sentence. So I am looking for a regular expression that looks for letter/dot/letter and then replace the dot with nothing.
test <- "t.h.i.s i.s a test."
gsub(test, pattern="\\w[[:punct:]]\\w", replacement="")
But this is the result
". a test."
Any suggestions are appreciated.
You can do the opposite, ie extract everything in the sentence that is not a dot in the middle of the string:
require(stringr)
test <- "t.h.i.s i.s a test."
paste0(str_extract_all(test, "[^\\.]|(\\.$)")[[1]], collapse = "")
[1] "this is a test."
If you want to include the possibility of multiple sentences, and we can assume that a dot followed by a space is allowed, then you can use:
test <- "t.h.i.s i.s a test. With a.n.other sen.t.ence."
paste0(str_extract_all(test, "[^\\.]|(\\.$)|(\\. )")[[1]], collapse = "")
[1] "this is a test. With another sentence."
Here is my best guess, and a suggestion on how to further enhance the pattern:
> test = "T.h.i.s is a U.S. state. I drove 5.5 miles. Mr. Smith know English, French, etc. and can drive a car."
> gsub("\\b((?:U[.]S|etc|M(?:r?s|r))[.]||\\d+[.]\\d+)|[.](?!$|\\s+\\p{Lu})", "\\1", test, perl=T)
[1] "T.h.i.s is a U.S. state. I drove 5.5 miles. Mr. Smith know English, French, etc. and can drive a car."
See the regex demo
Explanation:
\\b((?:U[.]S|etc|M(?:r?s|r))[.]|\\d+[.]\\d+)
- match the exceptions that we will restore with a \\1
backreference in the replacement part. This part matches US
, etc.
, Mr.
, Ms.
, Mrs.
, ditits+.digits
and can be enhanced |
- or [.](?!$|\\s+\\p{Lu})
- match a dot that is not followed by the end of the string ( $
) or 1+ whitespaces followed with an uppercase letter ( \\s+\\p{Lu}
) paste0(gsub('\\.', '', test), '.')
#[1] "this is a test."
To make this ugly to work with more sentences,
paste(paste0(gsub('\\.', '', unlist(strsplit(test, '\\. '))), '.'), collapse = ' ')
#[1] "this is a test. With another sentence."
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.