简体   繁体   中英

Regex that matches 4 or more words from a list

Background

We have a system that maintains a repository of regular expressions and checks some incoming text against these regex's for some filtering purposes. One of the regex's we are trying to build is described below. The solution I am looking for is strictly regex-based due to production constraints.

Problem

I have a list of words: word1, word2, word3, word4, word5, word6, word7, word8, word9, word10. I am trying to write a regular expression that matches a string if it contains 4 or more of these words, at any positions in any order.

Examples

  • "Abc word3 def word2 ghi word7 jkl word1 mno word5" should be a match, since it has more than 4 words from the given list.
  • "Abc word2 def ghi word8" shouldn't be a match, since it has only 2 words from the given list.

Current State

I have the following regex, but it doesn't seem to do what I need.

((?i)((word1)|(word2)|(word3)|(word4)|(word5)|(word6)|(word7)|(word8)|(word9)|(word10))\b){4,}

Any suggestions please, in either Java or Python notation?

Edit: Added some background information.

The following regex worked for all my tests:

(?i)(.*(^|\b)((word1)|(word2)|(word3)|(word4)|(word5)|(word6)|(word7)|(word8)|(word9)|(word10))($|\b).*){4,}

They include:

  1. "Abc word3 def word2 ghi word7 jkl word1 mno word5" -> true
  2. "Abc word2 def ghi word8" -> false
  3. "word3 sadasd sadasd word1 word2 word4" -> true
  4. "word3 sadasd sadasd word1 word2word4" -> false
  5. "aword3 sadasd sadasd word1 word2 word4" -> false
  6. "word3 sadasd sadasd word1 word2 word4a" -> false

I think your original regex was missing mainly the .* to match any string before and after the keywords.

I also took care to check if it's the beginning of line or a boundary character before a keyword (test 5) wich I think was also missing.

You don't need to use regular expression. If all you're concerned about is the number of occurrences of any word, then you can convert the input list into a set and perform an intersection operation on it.

wrd_list = ["word1", "word2", "word3", "word4", "word5", "word6", "word7", "word8", "word9", "word10"]

s = "Abc word3 def word2 ghi word7 jkl word1 mno word5"

if len(set(wrd_list).intersection(s.split())) > 4:
    print('more than 4 occurrences found')

EDIT: This code is in Python

Perhaps this (not regex, but I think more readable):

words = ['word1', 'word2', 'word3', 'word4', 'word5', 'word6', 'word7', 'word8', 'word9', 'word10']
text = "Abc word2 def ghi word8"
sum(i in text for i in words)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM