I have url strings such as:
"https://facty.com/ailments/body/10-home-remedies-for-styes/pageid=2/slide_3/"
Now, I need to capture the slide_3
part, more specifically the start position of the digit 3
on constraint that it should be a single digit( neither preceded nor succeeded by any digit) not preceded by an "=". So, pageid=2
shouldn't match while slide_3
should.
I tried this with python regex:
p = re.compile('/.*(?<!=)(?<!\d)\d(?!\d).*/')
s = "https://facty.com/ailments/body/10-home-remedies-for-styes/pageid=2/slide_3/"
for m in p.finditer(s):
print(m.start(), m.group())
and the result is
6 //facty.com/ailments/body/10-home-remedies-for-styes/pageid=2/slide_3/
I understand why I get this, the first and the last "/" satisfy the regexp but so does the substring "/slide_3/".
How do I make sure I get the smallest substring that matches the regex.
Why doesn't this work:
'/[^/](?<!=)(?<!\d)\d(?!\d).*/'
Non greedy operator .*?
does not seem to do the trick since it does not guarantee the shortest possible match.
Strings that should match:
"https://facty.com/ailments/body/10-home-remedies-for-styes/pageid=2/slide_3/"
"https://facty.com/ailments/body/10-home-remedies-for-styes/pageid=2/sno3/"
"https://facty.com/ailments/body/10-home-remedies-for-styes/pageid=2/3/"
and the matches should be slide_3 , sno3, 3 respectively
Strings which shouldn't:
"https://facty.com/ailments/body/10-home-remedies-for-styes/pageid=2/slide/"
"https://facty.com/ailments/body/10-home-remedies-for-styes/pageid=2/slide_33/"
"https://facty.com/ailments/body/10-home-remedies-for-styes/pageid=2/33/"
If I understand your question then you can use this to check if a string matches your expected pattern:
(?:^.*\/)([^\d]*\d)(?:\/?$)
and \\1
will contain:
slide_3
sno3
3
https://regex101.com/r/h0rNdC/4
This could be useful in getting the index of the match: Python Regex - How to Get Positions and Values of Matches
You could match the forward slash, then match 0+ times any char except a digit, /
, =
or a newline.
Capture a single digit in a capturing group and match the trailing forward slash.
To get the start and the end indices of the match, you could for example use re.search which will return a match object .
/[^\d/=\r\n]*(\d)/
For example
import re
regex = r"/[^\d/=\r\n]*(\d)/"
strings = [
"https://facty.com/ailments/body/10-home-remedies-for-styes/pageid=2/slide_3/",
"https://facty.com/ailments/body/10-home-remedies-for-styes/pageid=2/sno3/",
"https://facty.com/ailments/body/10-home-remedies-for-styes/pageid=2/3/",
"https://facty.com/ailments/body/10-home-remedies-for-styes/pageid=2/slide/",
"https://facty.com/ailments/body/10-home-remedies-for-styes/pageid=2/slide_33/",
"https://facty.com/ailments/body/10-home-remedies-for-styes/pageid=2/33/"
]
for s in strings:
matches = re.search(regex, s)
if matches:
print ("Group {groupNum} found at {start}-{end} value:{group}".format(groupNum = 1, start = matches.start(1), end = matches.end(1), group = matches.group(1)))
Result
Group 1 found at 74-75 value:3
Group 1 found at 71-72 value:3
Group 1 found at 68-69 value:3
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.