I have a file named 'ab9c_xy8z_12a3.pdf' . I want to capture part after the last underscore and before '.pdf'. Writing regular expression like :
s = 'ab9c_xy8z_12a3.pdf'
m = re.search(r'_.*?\.pdf',s)
m.group(0)
returns: '_xy8z_12a3.pdf'
In this example, I would like to capture only '12a3' part. Thank you for your help.
The _.*?\\.pdf
regex matches the first underscore with _
, then matches any 0+ chars other than a newline, as few as possible, but up to the leftmost occurrence of .pdf
, which turns out to be at the end of the string. So, .
matched all underscores on its way to .pdf
, just because of the way a regex engine parses the string (from left to right) and due to .
pattern.
You may fix the pattern by using a negated character class [^_]
instead of .
that will "subtract" underscores from .
pattern.
([^_]+)\.pdf
and grab Group 1 value. See the regex demo .
import re
rx = r"([^_]+)\.pdf"
s = "ab9c_xy8z_12a3.pdf"
m = re.search(rx, s)
if m:
print(m.group(1)) # => 12a3
Use re.split
instead:
>>> re.split('[_.]', 'ab9c_xy8z_12a3.pdf')[-2]
'12a3'
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.