简体   繁体   中英

How to find the match URL from a HTML page using RegEx Python

I'm trying to match the following URL by its query string from a html page in Python but could not able to solved it. I'm a newbie in python.

<a href="http://example.com/?query_id=9&user_id=49&token_id=4JGO4I394HD83E" id="838"/>

I want to match the above URL with &user_id=[any_digit_from_0_to_99]& and print this URL on the screen.

URL without this &user_id=[any_digit_from_0_to_99]& wont be match.

Here's my horror incomplete regex code: https?:\\/\\/.{0,30}\\.+[a-zA-Z0-9\\/?_+=]{0,30}&user_id=[0-9][0-9]&.*?"

I know this code has so many wrong, but this code somehow managed to match the above URL till " double qoute.

Complete code would look like this:

import re

reg = re.compile(r'https?:\/\/.{0,30}\.+[a-zA-Z0-9\/?_+=]{0,30}&user_id=[0-9][0-9]&.*?"')
str = '<a href="http://example.com/?query_id=9&user_id=49&token_id=4JGO4I394HD83E" id="838"/>'
result = reg.search(str)
result = result.group()
print(result)

Output:

$ python reg.py
http://example.com/?query_id=9&user_id=49&token_id=4JGO4I394HD83E"

It shows the " at the end of the URL and I know this is not the good regex code I want the better version of my above code.

A few remarks can be made on your regexp:

  1. / is not a special re character, there's no need to escape it
  2. Has the fact that the domain can't be larger than 30 chracters been done on purpose? Otherwise, you can just select as much characters as you want with .*
  3. Do you know that the string you're working with contains a valid URL? If no, there are some things you can do, like ensuring the domain is at least 4 chracters long, contains a period which is not the last character, etc...
  4. The [0-9][0-9] part will also match stuff like 04 , which is not strictly speaking a digit between 0 and 99

Taking this into account, you can design this simpler regex:

reg = re.compile("https?://.*&user_id=[1-9][0-9]?&")
str = '<a href="http://example.com/?query_id=9&user_id=49&token_id=4JGO4I394HD83E" id="838"/>'
result = reg.search(str)
result = result.group()
print(result)

Using this regex on your example will print 'http://example.com/?query_id=9&user_id=4&' , without the " at the end. If you want to have to full URL, then you can look for the /> symbol:

reg = re.compile("https?://.*&user_id=[1-9][0-9]?&.*/>")
str = '<a href="http://example.com/?query_id=9&user_id=49&token_id=4JGO4I394HD83E" id="838"/>'
result = reg.search(str)
result = result.group()[:-2]
print(result)

Note the [:-2] which is used to remove the /> symbol. In that case, this code will print http://example.com/?query_id=9&user_id=4&token_id=4JGO4I394HD83E" id="838" .

Note also that these regexp usesthe wildcard . . Depending on whether you are sure that the strings you're working with contains only valid URLs, you may want to change this. For instance, a domain name can only contain ASCII characters. You may want to look at the \\w special sequence with the ASCII flag of the re module .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM