I have a whole lot of OCR'ed text that has quite a lot of unwanted text in it. The problem at hand is to find words with at least 3 characters in them but does NOT have more than 3 sequential repetitions of the same character.
I have gotten as far as getting two differnt Regex expressions works for the two different rules but not sure how to combine them
This one matches words with 3 sequential repetitions: (This will need to be negated when combined with the next one) (.*)\\1{2,}
This one matches words with 3 or more alpha characters \\b[a-zA-Z]{3,}\\b
I now need to add these two together and make one expression. Here are some examples
Words I want to match
Words I DO NOT want to match
Any help will be appreciated.
Use negative lookahead for the detection of repeating characters. You know the rest of the solution already :-)
/\b(?![a-z]*?([a-z])\1{2})[a-z]{3,}\b/i
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.