简体   繁体   中英

Regex match word if optionally followed by any word unless followed by certain words

I'm trying to match the following cases except case 6 and case 8 :

case 1 -    deliverto               should match
case 2 -    deliveryto :            should match
case 3 -    deliveryto:             should match
case 4 -    delivery to :           should match
case 5 -    delivery address :      should match
case 6 -    delivery order :        should NOT match
case 7 -    ship to:                should match
case 8 -    delivery inst :         should NOT match
case 9 -    delivery                should match
case 10 -   remit to :              should match
case 11 -   send to:                should match
case 12 -   remitto:                should match
case 13 -   delivery:               should match
case 14 -   deliver:                should match
case 15 -   delv. :                 should match

My logic is : Match 1st chunk [ ship or send or remit or deliver or delivery or delv. (dot is optional)] word if 2nd chunk [ to or address ] is found after that or even 2nd chunk is not found but don't take 1st chunk [ ship or ...] if you find 3rd chunk [ order or inst ] after 1st chunk.

I've used a negative look ahead for 3rd chunk followed by an optional positive look ahead for 2nd chunk. Here is the regex I have been trying :

pattern = r"(send|remit|ship|delivery|deliver|delv\\.?)\\s?(?!(Order|inst))(?=(to|address)?)\\:?"

First problem I'm facing is : the regex matches even when the 1st chunk is followed by 3rd chunk.

Second problem is : if the possible cases are in a list and I try re.finditer() on them, the optional 2nd chunk is not being matched :

l = ['case 1 -     deliverto', 'case 2 -     deliveryto :', 'case 3 -     deliveryto:     ', 'case 4 -     delivery to :', 'case 5 -     delivery address :', 'case 6 -     delivery order :', 'case 7 -     ship to:', 'case 8 -     delivery inst :', 'case 9 -     delivery  ', 'case 10 -     remit to :', 'case 11 -     send to:', 'case 12 -     remitto:', 'case 13 -     delivery: ', 'case 14 -     deliver: ', 'case 15 -     delv. :']
for i in l:
    print([i.group() for i in re.finditer(patern, i, re.IGNORECASE)])

gives :

['deliver']
['delivery']
['delivery']
['delivery ']
['delivery ']
['delivery']
['ship ']
['delivery']
['delivery ']
['remit ']
['send ']
['remit']
['delivery:']
['deliver:']
['delv. :']

I need to match with the optional to or address chunk if found. What am I doing wrong in the regex?

For implemented details, have a look at this regex101 site. Thanks.

You need to fail the regex match after you find the first word:

(?i)\b(?!\S+\s+(?:order|inst))(?:send|remit|ship|delivery?|delv\.?)(?:\s*(?:to|address))?\s?:?

See the regex demo

Details :

  • (?i) - case insensitive matching ON (same as re.I )
  • \\b - word boundary
  • (?!\\S+\\s+(?:order|inst)) - fail the match if 1+ non-whitespace chars, 1+ whitespace chars and then an order or inst appears immediately on the right
  • (?:send|remit|ship|delivery?|delv\\.?) - send , remit , ship , deliver or delivery , delv or delv.`
  • (?:\\s*(?:to|address))? - an optional sequence of 0+ whitespaces and then to or address
  • \\s? - an optional whitespace
  • :? - an optional colon.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM