简体   繁体   中英

How to create non-greedy regular expression from right?

I have a file named 'ab9c_xy8z_12a3.pdf' . I want to capture part after the last underscore and before '.pdf'. Writing regular expression like :

    s = 'ab9c_xy8z_12a3.pdf'
    m = re.search(r'_.*?\.pdf',s)
    m.group(0)

returns: '_xy8z_12a3.pdf'

In this example, I would like to capture only '12a3' part. Thank you for your help.

The _.*?\\.pdf regex matches the first underscore with _ , then matches any 0+ chars other than a newline, as few as possible, but up to the leftmost occurrence of .pdf , which turns out to be at the end of the string. So, . matched all underscores on its way to .pdf , just because of the way a regex engine parses the string (from left to right) and due to . pattern.

You may fix the pattern by using a negated character class [^_] instead of . that will "subtract" underscores from . pattern.

([^_]+)\.pdf

and grab Group 1 value. See the regex demo .

Python demo :

import re
rx = r"([^_]+)\.pdf"
s = "ab9c_xy8z_12a3.pdf"
m = re.search(rx, s)
if m:
    print(m.group(1)) # => 12a3

Use re.split instead:

>>> re.split('[_.]', 'ab9c_xy8z_12a3.pdf')[-2]
'12a3'

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM