I'm trying to match the following URL by its query string from a html page in Python but could not able to solved it. I'm a newbie in python.
<a href="http://example.com/?query_id=9&user_id=49&token_id=4JGO4I394HD83E" id="838"/>
I want to match the above URL with &user_id=[any_digit_from_0_to_99]&
and print this URL on the screen.
URL without this &user_id=[any_digit_from_0_to_99]&
wont be match.
Here's my horror incomplete regex code: https?:\\/\\/.{0,30}\\.+[a-zA-Z0-9\\/?_+=]{0,30}&user_id=[0-9][0-9]&.*?"
I know this code has so many wrong, but this code somehow managed to match the above URL till "
double qoute.
Complete code would look like this:
import re
reg = re.compile(r'https?:\/\/.{0,30}\.+[a-zA-Z0-9\/?_+=]{0,30}&user_id=[0-9][0-9]&.*?"')
str = '<a href="http://example.com/?query_id=9&user_id=49&token_id=4JGO4I394HD83E" id="838"/>'
result = reg.search(str)
result = result.group()
print(result)
Output:
$ python reg.py
http://example.com/?query_id=9&user_id=49&token_id=4JGO4I394HD83E"
It shows the "
at the end of the URL and I know this is not the good regex code I want the better version of my above code.
A few remarks can be made on your regexp:
/
is not a special re
character, there's no need to escape it .*
[0-9][0-9]
part will also match stuff like 04
, which is not strictly speaking a digit between 0 and 99 Taking this into account, you can design this simpler regex:
reg = re.compile("https?://.*&user_id=[1-9][0-9]?&")
str = '<a href="http://example.com/?query_id=9&user_id=49&token_id=4JGO4I394HD83E" id="838"/>'
result = reg.search(str)
result = result.group()
print(result)
Using this regex on your example will print 'http://example.com/?query_id=9&user_id=4&'
, without the "
at the end. If you want to have to full URL, then you can look for the />
symbol:
reg = re.compile("https?://.*&user_id=[1-9][0-9]?&.*/>")
str = '<a href="http://example.com/?query_id=9&user_id=49&token_id=4JGO4I394HD83E" id="838"/>'
result = reg.search(str)
result = result.group()[:-2]
print(result)
Note the [:-2]
which is used to remove the />
symbol. In that case, this code will print http://example.com/?query_id=9&user_id=4&token_id=4JGO4I394HD83E" id="838"
.
Note also that these regexp usesthe wildcard .
. Depending on whether you are sure that the strings you're working with contains only valid URLs, you may want to change this. For instance, a domain name can only contain ASCII characters. You may want to look at the \\w
special sequence with the ASCII flag of the re module .
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.