简体   繁体   中英

Regex find words that have at least X characters with no more than Y sequential repetitions of a character

I have a whole lot of OCR'ed text that has quite a lot of unwanted text in it. The problem at hand is to find words with at least 3 characters in them but does NOT have more than 3 sequential repetitions of the same character.

I have gotten as far as getting two differnt Regex expressions works for the two different rules but not sure how to combine them

This one matches words with 3 sequential repetitions: (This will need to be negated when combined with the next one) (.*)\\1{2,}

This one matches words with 3 or more alpha characters \\b[a-zA-Z]{3,}\\b

I now need to add these two together and make one expression. Here are some examples

Words I want to match

  • Jack
  • Slack
  • Traack
  • Maacka

Words I DO NOT want to match

  • Jac (Not long enough)
  • Slaaack (Has 3 SEQUENTIAL repetitions of "A")

Any help will be appreciated.

Use negative lookahead for the detection of repeating characters. You know the rest of the solution already :-)

/\b(?![a-z]*?([a-z])\1{2})[a-z]{3,}\b/i

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM