R: How to replace a dot between two characters in a string

Question

I am processing a lot of old text material. Quite often the OCR process puts a "." in a word, for example "this is a test." I want to replace these dots with an empty space "". But I do not want to get rid of the dots that indicate the end of a sentence. So I am looking for a regular expression that looks for letter/dot/letter and then replace the dot with nothing.

    test <- "t.h.i.s i.s a test." 
    gsub(test, pattern="\\w[[:punct:]]\\w", replacement="")

But this is the result

    ".  a test."

Any suggestions are appreciated.

Answer 1

You can do the opposite, ie extract everything in the sentence that is not a dot in the middle of the string:

require(stringr)
test <- "t.h.i.s i.s a test." 
paste0(str_extract_all(test, "[^\\.]|(\\.$)")[[1]], collapse = "")

[1] "this is a test."

If you want to include the possibility of multiple sentences, and we can assume that a dot followed by a space is allowed, then you can use:

test <- "t.h.i.s i.s a test. With a.n.other sen.t.ence." 
paste0(str_extract_all(test, "[^\\.]|(\\.$)|(\\. )")[[1]], collapse = "")

[1] "this is a test. With another sentence."

Answer 2

Here is my best guess, and a suggestion on how to further enhance the pattern:

> test = "T.h.i.s is a U.S. state. I drove 5.5 miles. Mr. Smith know English, French, etc. and can drive a car."
> gsub("\\b((?:U[.]S|etc|M(?:r?s|r))[.]||\\d+[.]\\d+)|[.](?!$|\\s+\\p{Lu})", "\\1", test, perl=T)
[1] "T.h.i.s is a U.S. state. I drove 5.5 miles. Mr. Smith know English, French, etc. and can drive a car."

See the regex demo

Explanation:

\\b((?:U[.]S|etc|M(?:r?s|r))[.]|\\d+[.]\\d+) - match the exceptions that we will restore with a \\1 backreference in the replacement part. This part matches US , etc. , Mr. , Ms. , Mrs. , ditits+.digits and can be enhanced
| - or
[.](?!$|\\s+\\p{Lu}) - match a dot that is not followed by the end of the string ( $ ) or 1+ whitespaces followed with an uppercase letter ( \\s+\\p{Lu} )

Answer 3

paste0(gsub('\\.', '', test), '.')
#[1] "this is a test."

To make this ugly to work with more sentences,

paste(paste0(gsub('\\.', '', unlist(strsplit(test, '\\. '))), '.'), collapse = ' ')
#[1] "this is a test. With another sentence."

R: How to replace a dot between two characters in a string

Question

3 answers

solution1
2 2016-04-15 13:04:04

solution2
2 2016-04-15 13:42:20

solution3
0 2016-04-15 12:58:41

R: How to replace a dot between two characters in a string

Question

3 answers

solution1 2 2016-04-15 13:04:04

solution2 2 2016-04-15 13:42:20

solution3 0 2016-04-15 12:58:41

solution1
2 2016-04-15 13:04:04

solution2
2 2016-04-15 13:42:20

solution3
0 2016-04-15 12:58:41