简体   繁体   中英

Building a regular expression to find text near each other

I'm having issue getting this search to work:

import re

word1 = 'this'
word2 = 'that'
sentence = 'this and that'

print(re.search('(?:\b(word1)\b(?: +[^ \n]*){0,5} *\b(word2)\b)|(?:\b(word2)\b(?: +[^ \n]*){0,5} *\b(word1)\b)',sentence))

I need to build a regex search to find if a string has up to 5 different sub-strings in any order within a certain number of other words (so two strings could be 3 words apart, three strings a total of 6 words apart, etc).

I've found a number of similar questions such as Regular expression gets 3 words near each other. How to get their context? or How to check if two words are next to each other in Python? , but none of them quite do this.

So if the search words were 'this', 'that', 'these', and 'those' and they appeared within 9 words of each other in any order, then the script would output True.

It seems like writing an if/else block with all sorts of different regex statements to accommodate the different permutations would be rather cumbersome, so I'm hoping there is a more efficient way to code this in Python.

This can be done using engines that support conditionals, atomic groups
and capture group status as flaged, marked EMPTY or NULL . Where null is undefined.

So this is almost all modern engines. Some are incomplete though like JS.
Python can support this using its replacement engine import regex .

Basically this will support out of order and can be confined to the shortest
range from 4 to 9 total words.
The bottom (?= \1 \2 \3 \4 ) asserts that all the required items were found.
Using this without the atomic group might cause backtrack problems, but since it
is there, this regex is very fast.

update: added lookahead (?= this | that | these | those ) so it starts match on a special word.

Python code

>>> import regex
>>>
>>> targ = 'this  sdgbsesfrgnh these meat ball those  nhwsgfr that sfdng  sfgnsefn sfgnndfsng'
>>> pat = r'(?=this|that|these|those)(?>\s*(?:(?(1)(?!))\bthis\b()|(?(2)(?!))\bthat\b()|(?(3)(?!))\bthese\b()|(?(4)(?!))\bthose\b()|(?(5)(?!))\b(.+?)\b|(?(6)(?!))\b(.+?)\b|(?(7)(?!))\b(.+?)\b|(?(8)(?!))\b(.+?)\b|(?(9)(?!))\b(.+?)\b)\s*){4,9
}?(?=\1\2\3\4)'
>>>
>>> regex.search(pat, targ).group()
'this  sdgbsesfrgnh these meat ball those  nhwsgfr that '

General PCRE / Perl et all (same regex)

(?=this|that|these|those)(?>\s*(?:(?(1)(?!))\bthis\b()|(?(2)(?!))\bthat\b()|(?(3)(?!))\bthese\b()|(?(4)(?!))\bthose\b()|(?(5)(?!))\b(.+?)\b|(?(6)(?!))\b(.+?)\b|(?(7)(?!))\b(.+?)\b|(?(8)(?!))\b(.+?)\b|(?(9)(?!))\b(.+?)\b)\s*){4,9}?(?=\1\2\3\4)

https://regex101.com/r/zhSa64/1

 (?= this | that | these | those )
 (?>
    \s* 
    (?:
       (?(1)(?!))
       \b this \b ( )                # (1)
     |
       (?(2)(?!))
       \b that \b ( )                # (2)
     | 
       (?(3)(?!))
       \b these \b ( )               # (3)
     | 
       (?(4)(?!))
       \b those \b ( )               # (4)
     |
       (?(5)(?!))
       \b ( .+? ) \b                 # (5)
     |
       (?(6)(?!))
       \b ( .+? ) \b                 # (6)
     |
       (?(7)(?!))
       \b ( .+? ) \b                 # (7)
     |
       (?(8)(?!))
       \b ( .+? ) \b                 # (8)
     |
       (?(9)(?!))
       \b ( .+? ) \b                 # (9)
    )
    \s* 
 ){4,9}?
 (?= \1 \2 \3 \4 )

ANSWER CHANGED because I found a way to do it with just a regular expression. The approach is to start with a lookahead that requires all target words to be present in the next N words. Then look for a pattern of target words (in any order) separated by 0 or more other words (up to the allowed maximum intermediate words)

The word span (N) is the greatest number of words that would allow all the target words to be at the maximum allowed distance.

For example, if we have 3 target words, and we allow a maximum of 4 other words between them, then the maximum word span will be 11. So 3 target words plus 2 intermediate series of maximum 4 other words 3+4+4=11.

The search pattern is formed by assembling parts that depend on the words and the maximum number of intermediate words allowed.

Pattern: \bALL((ANY)(\W+\w+\W*){0,INTER}){COUNT,COUNT}

breakdown:

  • \b start on a word boundary
  • ALL will be substituted by multiple lookaheads that will ensure that every target word is found in the next N words.
  • each lookahead will have the form (?=(\w+\W*){0,SPAN}WORD\b) where WORD is a target word and SPAN is the number of other words in the longest possible sequence of words. There will be one such lookahead for each of the target words. Thus ensuring that the sequence of N words contains all of target words.
  • (\b(ANY)(\W+\w+\W*){0,INTER}) matches any target word followed by zero to maxInter intermediate words. In that, ANY will be replaced by a pattern that matches any of the target words (ie the words separated by pipes). And INTER will be replaced by the allowed number of intermediate words.
  • {COUNT,COUNT} ensured that there are as many repetitions of the above as there are target words. This corresponds to the pattern: targetWord+intermediates+targetWord+intermediates...+targetWord
  • With the look ahead placed before the repeating pattern, we are guaranteed to have all the target words in the sequence of words containing exactly the number of target words with no more intermediate words than is allowed.

...

import re

words    = {"this","that","other"}
maxInter = 3 # maximum intermediate words between the target words

wordSpan = len(words)+maxInter*(len(words)-1)

anyWord  = "|".join(words)
allWords = "".join(r"(?=(\w+\W*){0,SPAN}WORD\b)".replace("WORD",w) 
                    for w in words)
allWords = allWords.replace("SPAN",str(wordSpan-1))
                    
pattern = r"\bALL(\b(ANY)(\W+\w+\W*){0,INTER}){COUNT,COUNT}"
pattern = pattern.replace("COUNT",str(len(words)))
pattern = pattern.replace("INTER",str(maxInter))
pattern = pattern.replace("ALL",allWords)
pattern = pattern.replace("ANY",anyWord)


textList = [
   "looking for this and that and some other thing", # YES
   "that rod is longer than this other one",         # NO: 4 words apart
   "other than this, I have nothing",                # NO: missing "that"
   "ignore multiple words not before this and that or other", # YES
   "this and that or other, followed by a bunch of words",    # YES
           ] 

output:

print(pattern)

\b(?=(\w*\b\W+){0,8}this\b)(?=(\w*\b\W+){0,8}other\b)(?=(\w*\b\W+){0,8}that\b)(\b(other|this|that)\b(\w*\b\W+){0,3}){3,3}

for text in textList:
    found = bool(re.search(pattern,text))
    print(found,"\t:",text)

True    : looking for this and that and some other thing
False   : that rod is longer than this other one
False   : other than this, I have nothing
True    : ignore multiple words not before this and that or other
True    : this and that or other, followed by a bunch of words

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM