Regex that matches 4 or more words from a list

Question

Background

We have a system that maintains a repository of regular expressions and checks some incoming text against these regex's for some filtering purposes. One of the regex's we are trying to build is described below. The solution I am looking for is strictly regex-based due to production constraints.

Problem

I have a list of words: word1, word2, word3, word4, word5, word6, word7, word8, word9, word10. I am trying to write a regular expression that matches a string if it contains 4 or more of these words, at any positions in any order.

Examples

"Abc word3 def word2 ghi word7 jkl word1 mno word5" should be a match, since it has more than 4 words from the given list.
"Abc word2 def ghi word8" shouldn't be a match, since it has only 2 words from the given list.

Current State

I have the following regex, but it doesn't seem to do what I need.

((?i)((word1)|(word2)|(word3)|(word4)|(word5)|(word6)|(word7)|(word8)|(word9)|(word10))\b){4,}

Any suggestions please, in either Java or Python notation?

Edit: Added some background information.

Answer 1

The following regex worked for all my tests:

(?i)(.*(^|\b)((word1)|(word2)|(word3)|(word4)|(word5)|(word6)|(word7)|(word8)|(word9)|(word10))($|\b).*){4,}

They include:

"Abc word3 def word2 ghi word7 jkl word1 mno word5" -> true
"Abc word2 def ghi word8" -> false
"word3 sadasd sadasd word1 word2 word4" -> true
"word3 sadasd sadasd word1 word2word4" -> false
"aword3 sadasd sadasd word1 word2 word4" -> false
"word3 sadasd sadasd word1 word2 word4a" -> false

I think your original regex was missing mainly the .* to match any string before and after the keywords.

I also took care to check if it's the beginning of line or a boundary character before a keyword (test 5) wich I think was also missing.

Answer 2

You don't need to use regular expression. If all you're concerned about is the number of occurrences of any word, then you can convert the input list into a set and perform an intersection operation on it.

wrd_list = ["word1", "word2", "word3", "word4", "word5", "word6", "word7", "word8", "word9", "word10"]

s = "Abc word3 def word2 ghi word7 jkl word1 mno word5"

if len(set(wrd_list).intersection(s.split())) > 4:
    print('more than 4 occurrences found')

EDIT: This code is in Python

Answer 3

Perhaps this (not regex, but I think more readable):

words = ['word1', 'word2', 'word3', 'word4', 'word5', 'word6', 'word7', 'word8', 'word9', 'word10']
text = "Abc word2 def ghi word8"
sum(i in text for i in words)

Regex that matches 4 or more words from a list

Question

Background

Problem

Examples

Current State

3 answers

solution1
1 ACCPTED 2020-02-11 02:35:09

solution2
1 2020-02-11 02:45:36

solution3
0

Regex that matches 4 or more words from a list

Question

Background

Problem

Examples

Current State

3 answers

solution1 1 ACCPTED 2020-02-11 02:35:09

solution2 1 2020-02-11 02:45:36

solution3 0

solution1
1 ACCPTED 2020-02-11 02:35:09

solution2
1 2020-02-11 02:45:36

solution3
0