简体   繁体   中英

R: How to replace a dot between two characters in a string

I am processing a lot of old text material. Quite often the OCR process puts a "." in a word, for example "this is a test." I want to replace these dots with an empty space "". But I do not want to get rid of the dots that indicate the end of a sentence. So I am looking for a regular expression that looks for letter/dot/letter and then replace the dot with nothing.

    test <- "t.h.i.s i.s a test." 
    gsub(test, pattern="\\w[[:punct:]]\\w", replacement="")

But this is the result

    ".  a test."

Any suggestions are appreciated.

You can do the opposite, ie extract everything in the sentence that is not a dot in the middle of the string:

require(stringr)
test <- "t.h.i.s i.s a test." 
paste0(str_extract_all(test, "[^\\.]|(\\.$)")[[1]], collapse = "")

[1] "this is a test."

If you want to include the possibility of multiple sentences, and we can assume that a dot followed by a space is allowed, then you can use:

test <- "t.h.i.s i.s a test. With a.n.other sen.t.ence." 
paste0(str_extract_all(test, "[^\\.]|(\\.$)|(\\. )")[[1]], collapse = "")

[1] "this is a test. With another sentence."

Here is my best guess, and a suggestion on how to further enhance the pattern:

> test = "T.h.i.s is a U.S. state. I drove 5.5 miles. Mr. Smith know English, French, etc. and can drive a car."
> gsub("\\b((?:U[.]S|etc|M(?:r?s|r))[.]||\\d+[.]\\d+)|[.](?!$|\\s+\\p{Lu})", "\\1", test, perl=T)
[1] "T.h.i.s is a U.S. state. I drove 5.5 miles. Mr. Smith know English, French, etc. and can drive a car."

See the regex demo

Explanation:

  • \\b((?:U[.]S|etc|M(?:r?s|r))[.]|\\d+[.]\\d+) - match the exceptions that we will restore with a \\1 backreference in the replacement part. This part matches US , etc. , Mr. , Ms. , Mrs. , ditits+.digits and can be enhanced
  • | - or
  • [.](?!$|\\s+\\p{Lu}) - match a dot that is not followed by the end of the string ( $ ) or 1+ whitespaces followed with an uppercase letter ( \\s+\\p{Lu} )
paste0(gsub('\\.', '', test), '.')
#[1] "this is a test."

To make this ugly to work with more sentences,

paste(paste0(gsub('\\.', '', unlist(strsplit(test, '\\. '))), '.'), collapse = ' ')
#[1] "this is a test. With another sentence."

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM