简体   繁体   中英

Regex optional matches pattern

I'm trying to use the ? quantifier to match a pattern only if it exists, but I can't get it working as I want. In the example below I'm trying to extract pair of digits following AZA and ZZZ where ZZZ appears all the time, but AZA is optional. When AZA is missing, I just want to return a ('', [zzz-value]) pair (empty string instead of the AZA value):

Input:

AZA:00zx---
ZZZ:32fd---
testxfiler
gsdkfklsd
fdsfsk
AZA:06x---
ZZZ:50----
gsdkfklsd
gsdkfklsd
fdsfsk
fdsfsk
gsdkfklsd
fdsfsk
ZZZ:32zzz----
fdsfsk
fdsfsk
gsdkfklsd
fdsfsk
AZA:46----
ZZZ:53---

Desired output:

[(00,32), (06, 50), ('',32), (46,53)]

My attempt:

re.findall('(?:AZA:([0-9]*))?.*?ZZZ:([0-9]*)', text, re.DOTALL)

My output:

[('00', '32'), ('', '50'), ('', '32'), ('', '53')
(?:AZA:(\d+).*?)?ZZZ:(\d+)

See demo

import re
p = re.compile(ur'(?:AZA:(\d+).*?)?ZZZ:(\d+)', re.DOTALL)
test_str = u"AZA:00zx---\nZZZ:32fd---\ntestxfiler\ngsdkfklsd\nfdsfsk\nAZA:06x---\nZZZ:50----\ngsdkfklsd\ngsdkfklsd\nfdsfsk\nfdsfsk\ngsdkfklsd\nfdsfsk\nZZZ:32zzz----\nfdsfsk\nfdsfsk\ngsdkfklsd\nfdsfsk\nAZA:46----\nZZZ:53---"

re.findall(p, test_str)

You don't need to add DOTALL modifier,

>>> text = """AZA:00zx---
ZZZ:32fd---
testxfiler
gsdkfklsd
fdsfsk
AZA:06x---
ZZZ:50----
gsdkfklsd
gsdkfklsd
fdsfsk
fdsfsk
gsdkfklsd
fdsfsk
ZZZ:32zzz----
fdsfsk
fdsfsk
gsdkfklsd
fdsfsk
AZA:46----
ZZZ:53---"""
>>> re.findall(r'(?:AZA:([0-9]+)[\S\s]*?)?ZZZ:([0-9]+)', text)
[('00', '32'), ('06', '50'), ('', '32'), ('46', '53')]

DEMO

[\\S\\s]* would match any space or non-space characters zero or more times.

Why your regex fails to work?

(?:AZA:([0-9]*))?.*?ZZZ:([0-9]*)

We all know that in DOTALL mode, dot in the regex will match even line breaks also. So by making (?:AZA:([0-9]*))? as optional, the following .*? would match all the preceding characters which are present before ZZZ:([0-9]*) . So by including the following .*? into the preceding optional group makes AZA:(\\d+) to match if it presents and the digits following AZA: would be captured. Now, it won't do an unnecessary match.

A regex of the form

(?:AZA:(\\d+)[^\\n]*\\n)?(?:ZZZ:)(\\d+)[^\\n]* would be helpfull.

For example

>>>re.findall('(?:AZA:(\d+)[^\n]*\n)?(?:ZZZ:)(\d+)[^\n]*' ,x)
[('00', '32'), ('06', '50'), ('', '32'), ('46', '53')]
  • (?:AZA:(\\d+)[^\\n]*\\n)? matches :AZA: followed by digits \\d+ followed by anything other than \\n ( [^\\n] ). The quantifier at the end ? ensures that the entire group is optional. The digits are captured in group 1

  • (?:ZZZ:)(\\d+)[^\\n]* matches :ZZZ: followed by digit \\d+ and anything other than \\n . Digits captured in group 2

What you missed

re.findall('(?:AZA:([0-9]*))?.*?ZZZ:([0-9]*)', text, re.DOTALL)

the entire (?:AZA:([0-9]*))?.*? should have been made optional as

(?:AZA:([0-9]*))?.*?)?

followed by \\n

changing your regex like

re.findall('(?:AZA:([0-9]*).*?)?\\nZZZ:([0-9]*)' ,x)

will give output as

[('00', '32'), ('06', '50'), ('', '32'), ('46', '53')]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM