简体   繁体   中英

python regex optional capture group

I have the following problem matching the needed data from filenames like this:

miniseries.season 1.part 5.720p.avi
miniseries.part 5.720p.avi
miniseries.part VII.720p.avi     # episode or season expressed in Roman numerals

The "season XX" chunk may or may not be present or may be written in short form, like "s 1" or "seas 1"

In any case I would like to have 4 capture groups giving as output :

group1 : miniseries
group2 : 1 (or None)
group3 : 5
group4 : 720p.avi

So I've written a regex like this :

(^.*)\Ws[eason ]*(\d{1,2}|[ivxlcdm]{1,5})\Wp[art ]*(\d{1,2}|[ivxlcdm]{1,5})\W(.*$)

This only works when i have a fully specified filename, including the optional "season XX" string. Is it possible to write a regex that returns "None" as group2 if "season" is not found ?

It is easy enough to make the season group optional:

(^.*?)(?:\Ws(?:eason )?(\d{1,2}|[ivxlcdm]{1,5}))?\Wp(?:art )?(\d{1,2}|[ivxlcdm]{1,5})\W(.*$)

using a non-capturing group ( (?:...) ) plus the 0 or 1 quantifier ( ? ). I did have to make the first group non-greedy to prevent it from matching the season section of the name.

I also made the eason and art optional strings into non-capturing optional groups instead of character classes.

Result:

>>> import re
>>> p=re.compile(r'(^.*?)(?:\Ws(?:eason )?(\d{1,2}|[ivxlcdm]{1,5}))?\Wp(?:art )?(\d{1,2}|[ivxlcdm]{1,5})\W(.*$)', re.I)
>>> p.search('miniseries.season 1.part 5.720p.avi').groups()
('miniseries', '1', '5', '720p.avi')
>>> p.search('miniseries.part 5.720p.avi').groups()
('miniseries', None, '5', '720p.avi')
>>> p.search('miniseries.part VII.720p.avi').groups()
('miniseries', None, 'VII', '720p.avi')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM