如何在 Python 中驗證 url？（畸形與否）

Question

我有來自用戶的url ，我必須用獲取的 HTML 回復。

如何檢查 URL 是否格式錯誤？

例如：

url = 'google' # Malformed
url = 'google.com' # Malformed
url = 'http://google.com' # Valid
url = 'http://google' # Malformed

Answer 1

使用驗證器包：

>>> import validators
>>> validators.url("http://google.com")
True
>>> validators.url("http://google")
ValidationFailure(func=url, args={'value': 'http://google', 'require_tld': True})
>>> if not validators.url("http://google"):
...     print "not valid"
... 
not valid
>>>

使用 pip從 PyPI安裝它（ pip install validators ）。

Answer 2

事實上，我認為這是最好的方法。

from django.core.validators import URLValidator
from django.core.exceptions import ValidationError

val = URLValidator(verify_exists=False)
try:
    val('http://www.google.com')
except ValidationError, e:
    print e

如果您將verify_exists設置為True ，它實際上會驗證 URL 是否存在，否則它只會檢查它的格式是否正確。

編輯：啊，是的，這個問題是這個問題的重復：如何檢查 Django 的驗證器是否存在 URL？

Answer 3

django url 驗證正則表達式（來源）：

import re
regex = re.compile(
        r'^(?:http|ftp)s?://' # http:// or https://
        r'(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+(?:[A-Z]{2,6}\.?|[A-Z0-9-]{2,}\.?)|' #domain...
        r'localhost|' #localhost...
        r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})' # ...or ip
        r'(?::\d+)?' # optional port
        r'(?:/?|[/?]\S+)$', re.IGNORECASE)

print(re.match(regex, "http://www.example.com") is not None) # True
print(re.match(regex, "example.com") is not None)            # False

Answer 4

基於@DMfll 答案的對或錯版本：

try:
    # python2
    from urlparse import urlparse
except:
    # python3
    from urllib.parse import urlparse

a = 'http://www.cwi.nl:80/%7Eguido/Python.html'
b = '/data/Python.html'
c = 532
d = u'dkakasdkjdjakdjadjfalskdjfalk'
e = 'https://stackoverflow.com'

def uri_validator(x):
    try:
        result = urlparse(x)
        return all([result.scheme, result.netloc])
    except:
        return False

print(uri_validator(a))
print(uri_validator(b))
print(uri_validator(c))
print(uri_validator(d))
print(uri_validator(e))

給出：

True
False
False
False
True

Answer 5

如今，我根據 Padam 的回答使用以下內容：

$ python --version
Python 3.6.5

這是它的外觀：

from urllib.parse import urlparse

def is_url(url):
  try:
    result = urlparse(url)
    return all([result.scheme, result.netloc])
  except ValueError:
    return False

只需使用is_url("http://www.asdf.com") 。

希望能幫助到你！

Answer 6

我登陸此頁面，試圖找出一種將字符串驗證為“有效”網址的合理方法。 我在這里分享我使用 python3 的解決方案。 不需要額外的庫。

如果您使用的是 python2，請參閱https://docs.python.org/2/library/urlparse.html 。

如果您像我一樣使用 python3，請參閱https://docs.python.org/3.0/library/urllib.parse.html 。

import urllib
from pprint import pprint

invalid_url = 'dkakasdkjdjakdjadjfalskdjfalk'
valid_url = 'https://stackoverflow.com'
tokens = [urllib.parse.urlparse(url) for url in (invalid_url, valid_url)]

for token in tokens:
    pprint(token)

min_attributes = ('scheme', 'netloc')  # add attrs to your liking
for token in tokens:
    if not all([getattr(token, attr) for attr in min_attributes]):
        error = "'{url}' string has no scheme or netloc.".format(url=token.geturl())
        print(error)
    else:
        print("'{url}' is probably a valid url.".format(url=token.geturl()))

ParseResult(scheme='', netloc='', path='dkakasdkjdjakdjadjfalskdjfalk', params='', query='', fragment='')

ParseResult(scheme='https', netloc='stackoverflow.com', path='', params='', query='', fragment='')

'dkakasdkjdjakdjadjfalskdjfalk' 字符串沒有方案或 netloc。

' https://stackoverflow.com ' 可能是一個有效的 url。

這是一個更簡潔的函數：

from urllib.parse import urlparse

min_attributes = ('scheme', 'netloc')


def is_valid(url, qualifying=min_attributes):
    tokens = urlparse(url)
    return all([getattr(tokens, qualifying_attr)
                for qualifying_attr in qualifying])

Answer 7

注意- lepl 不再受支持，抱歉（歡迎您使用它，我認為下面的代碼有效，但不會得到更新）。

rfc 3696 http://www.faqs.org/rfcs/rfc3696.html定義了如何執行此操作（對於 http url 和電子郵件）。 我使用 lepl（一個解析器庫）在 python 中實現了它的建議。 見http://acooke.org/lepl/rfc3696.html

使用：

> easy_install lepl
...
> python
...
>>> from lepl.apps.rfc3696 import HttpUrl
>>> validator = HttpUrl()
>>> validator('google')
False
>>> validator('http://google')
False
>>> validator('http://google.com')
True

Answer 8

編輯

正如@Kwame 所指出的，即使.com或.co等不存在，以下代碼也會驗證 url。

@Blaise 還指出，像https://www.google 這樣的 URL 是一個有效的 URL，您需要單獨進行 DNS 檢查以檢查它是否解析。

這很簡單並且有效：

所以min_attr包含需要存在以定義 URL 有效性的基本字符串集，即http://部分和google.com部分。

urlparse.scheme存儲http://和

urlparse.netloc存放域名google.com

from urlparse import urlparse
def url_check(url):

    min_attr = ('scheme' , 'netloc')
    try:
        result = urlparse(url)
        if all([result.scheme, result.netloc]):
            return True
        else:
            return False
    except:
        return False

如果all()所有變量都返回 true，則返回 true。 所以如果result.scheme和result.netloc存在，即有一些值，那么 URL 是有效的，因此返回True 。

Answer 9

使用`urllib`和類似 Django 的正則表達式驗證 URL

Django URL 驗證正則表達式實際上非常好，但我需要為我的用例稍微調整它。 隨意適應它！

蟒蛇 3.7

import re
import urllib

# Check https://regex101.com/r/A326u1/5 for reference
DOMAIN_FORMAT = re.compile(
    r"(?:^(\w{1,255}):(.{1,255})@|^)" # http basic authentication [optional]
    r"(?:(?:(?=\S{0,253}(?:$|:))" # check full domain length to be less than or equal to 253 (starting after http basic auth, stopping before port)
    r"((?:[a-z0-9](?:[a-z0-9-]{0,61}[a-z0-9])?\.)+" # check for at least one subdomain (maximum length per subdomain: 63 characters), dashes in between allowed
    r"(?:[a-z0-9]{1,63})))" # check for top level domain, no dashes allowed
    r"|localhost)" # accept also "localhost" only
    r"(:\d{1,5})?", # port [optional]
    re.IGNORECASE
)
SCHEME_FORMAT = re.compile(
    r"^(http|hxxp|ftp|fxp)s?$", # scheme: http(s) or ftp(s)
    re.IGNORECASE
)

def validate_url(url: str):
    url = url.strip()

    if not url:
        raise Exception("No URL specified")

    if len(url) > 2048:
        raise Exception("URL exceeds its maximum length of 2048 characters (given length={})".format(len(url)))

    result = urllib.parse.urlparse(url)
    scheme = result.scheme
    domain = result.netloc

    if not scheme:
        raise Exception("No URL scheme specified")

    if not re.fullmatch(SCHEME_FORMAT, scheme):
        raise Exception("URL scheme must either be http(s) or ftp(s) (given scheme={})".format(scheme))

    if not domain:
        raise Exception("No URL domain specified")

    if not re.fullmatch(DOMAIN_FORMAT, domain):
        raise Exception("URL domain malformed (domain={})".format(domain))

    return url

解釋

該代碼僅驗證給定 URL 的scheme和netloc部分。 （為了正確地做到這一點，我將 URL 與urllib.parse.urlparse()分成兩個部分，然后與相應的正則表達式項匹配。）

netloc部分在第一次出現斜杠/之前停止，因此port號仍然是netloc一部分，例如：

 https://www.google.com:80/search?q=python ^^^^^ ^^^^^^^^^^^^^^^^^ | | | +-- netloc (aka "domain" in my code) +-- scheme

IPv4 地址也經過驗證

IPv6 支持

如果您希望 URL 驗證器也使用 IPv6 地址，請執行以下操作：

從Markus Jarderot 的答案中添加is_valid_ipv6(ip) ，它有一個非常好的 IPv6 驗證器正則表達式
添加and not is_valid_ipv6(domain)到最后， if

例子

以下是netloc （又名domain ）部分的正則表達式的一些示例：

IPv4 和字母數字： https : //regex101.com/r/A326u1/5
IPv6： https : //regex101.com/r/lKIIgq/1 （使用來自Markus Jarderot 的回答的正則表達式）

Answer 10

以上所有解決方案都將“ http://www.google.com/path,www.yahoo.com/path ”之類的字符串識別為有效。 此解決方案始終可以正常工作

import re

# URL-link validation
ip_middle_octet = u"(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5]))"
ip_last_octet = u"(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))"

URL_PATTERN = re.compile(
                        u"^"
                        # protocol identifier
                        u"(?:(?:https?|ftp|rtsp|rtp|mmp)://)"
                        # user:pass authentication
                        u"(?:\S+(?::\S*)?@)?"
                        u"(?:"
                        u"(?P<private_ip>"
                        # IP address exclusion
                        # private & local networks
                        u"(?:localhost)|"
                        u"(?:(?:10|127)" + ip_middle_octet + u"{2}" + ip_last_octet + u")|"
                        u"(?:(?:169\.254|192\.168)" + ip_middle_octet + ip_last_octet + u")|"
                        u"(?:172\.(?:1[6-9]|2\d|3[0-1])" + ip_middle_octet + ip_last_octet + u"))"
                        u"|"
                        # IP address dotted notation octets
                        # excludes loopback network 0.0.0.0
                        # excludes reserved space >= 224.0.0.0
                        # excludes network & broadcast addresses
                        # (first & last IP address of each class)
                        u"(?P<public_ip>"
                        u"(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])"
                        u"" + ip_middle_octet + u"{2}"
                        u"" + ip_last_octet + u")"
                        u"|"
                        # host name
                        u"(?:(?:[a-z\u00a1-\uffff0-9_-]-?)*[a-z\u00a1-\uffff0-9_-]+)"
                        # domain name
                        u"(?:\.(?:[a-z\u00a1-\uffff0-9_-]-?)*[a-z\u00a1-\uffff0-9_-]+)*"
                        # TLD identifier
                        u"(?:\.(?:[a-z\u00a1-\uffff]{2,}))"
                        u")"
                        # port number
                        u"(?::\d{2,5})?"
                        # resource path
                        u"(?:/\S*)?"
                        # query string
                        u"(?:\?\S*)?"
                        u"$",
                        re.UNICODE | re.IGNORECASE
                       )
def url_validate(url):   
    """ URL string validation
    """                                                                                                                                                      
    return re.compile(URL_PATTERN).match(url)

Answer 11

不直接相關，但通常需要確定某些令牌是否可以是 url，不一定 100% 正確形成（即省略 https 部分等）。 我已經閱讀了這篇文章但沒有找到解決方案，所以為了完整起見，我在這里發布我自己的。

def get_domain_suffixes():
    import requests
    res=requests.get('https://publicsuffix.org/list/public_suffix_list.dat')
    lst=set()
    for line in res.text.split('\n'):
        if not line.startswith('//'):
            domains=line.split('.')
            cand=domains[-1]
            if cand:
                lst.add('.'+cand)
    return tuple(sorted(lst))

domain_suffixes=get_domain_suffixes()

def reminds_url(txt:str):
    """
    >>> reminds_url('yandex.ru.com/somepath')
    True
    
    """
    ltext=txt.lower().split('/')[0]
    return ltext.startswith(('http','www','ftp')) or ltext.endswith(domain_suffixes)

Answer 12

這是一個正則表達式解決方案，因為最高投票的正則表達式不適用於頂級域等奇怪的情況。 下面是一些測試用例。

regex = re.compile(
    r"(\w+://)?"                # protocol                      (optional)
    r"(\w+\.)?"                 # host                          (optional)
    r"((\w+)\.(\w+))"           # domain
    r"(\.\w+)*"                 # top-level domain              (optional, can have > 1)
    r"([\w\-\._\~/]*)*(?<!\.)"  # path, params, anchors, etc.   (optional)
)

cases = [
    "http://www.google.com",
    "https://www.google.com",
    "http://google.com",
    "https://google.com",
    "www.google.com",
    "google.com",
    "http://www.google.com/~as_db3.2123/134-1a",
    "https://www.google.com/~as_db3.2123/134-1a",
    "http://google.com/~as_db3.2123/134-1a",
    "https://google.com/~as_db3.2123/134-1a",
    "www.google.com/~as_db3.2123/134-1a",
    "google.com/~as_db3.2123/134-1a",
    # .co.uk top level
    "http://www.google.co.uk",
    "https://www.google.co.uk",
    "http://google.co.uk",
    "https://google.co.uk",
    "www.google.co.uk",
    "google.co.uk",
    "http://www.google.co.uk/~as_db3.2123/134-1a",
    "https://www.google.co.uk/~as_db3.2123/134-1a",
    "http://google.co.uk/~as_db3.2123/134-1a",
    "https://google.co.uk/~as_db3.2123/134-1a",
    "www.google.co.uk/~as_db3.2123/134-1a",
    "google.co.uk/~as_db3.2123/134-1a",
    "https://...",
    "https://..",
    "https://.",
    "https://.google.com",
    "https://..google.com",
    "https://...google.com",
    "https://.google..com",
    "https://.google...com"
    "https://...google..com",
    "https://...google...com",
    ".google.com",
    ".google.co."
    "https://google.co."
]
for c in cases:
    print(c, regex.match(c).span()[1] - regex.match(c).span()[0] == len(c))

Answer 13

基於 Dominic Tarro 答案的函數：

import re
def is_url(x):
    return bool(re.match(
        r"(https?|ftp)://" # protocol
        r"(\w+\.)?" # host (optional)
        r"((\w+)\.(\w+))" # domain
        r"(\.\w+)*" # top-level domain (optional, can have > 1)
        r"([\w\-\._\~/]*)*(?<!\.)" # path, params, anchors, etc. (optional)
    , x))

Answer 14

Pydantic 可以用來做到這一點。 我不是很習慣，所以我不能說它的局限性。 這是你的一個選擇，沒有人建議它。

我看到很多人在之前的答案中質疑 ftp 和文件 URL，所以我建議讓文檔知道 Pydantic 有許多類型用於驗證，如 FileUrl、AnyUrl 甚至數據庫 url 類型。

一個簡單的用法示例：

from requests import get, HTTPError, ConnectionError
from pydantic import BaseModel, AnyHttpUrl, ValidationError
    
class MyConfModel(BaseModel):
    URI: AnyHttpUrl

try:
    myAddress = MyConfModel(URI = "http://myurl.com/")
    req = get(myAddress.URI, verify=False)
    print(myAddress.URI)

except(ValidationError):
    print('Invalid destination')

Pydantic 還會引發可用於處理錯誤的異常 (pydantic.ValidationError)。

我用這些模式測試了它：

http://localhost（通過）
http://localhost:8080（通過）
http://example.com （通過）
http://user:password@example.com（通過）
http://_example.com（通過）
http://&example.com （失敗）
http://-example.com （失敗）

Answer 15

from urllib.parse import urlparse

def is_valid_url(url):
    try:
        result = urlparse(url)
        return all([result.scheme, result.netloc])
    except ValueError:
        return False

url = 'http://google.com'
if is_valid_url(url):
    print('Valid URL')
else:
    print('Malformed URL')

如何在 Python 中驗證 url？（畸形與否）

問題描述

15 個解決方案

解決方案1
169 2015-08-23 21:46:01

解決方案2
134 2011-08-23 12:10:16

解決方案3
108 已采納 2011-08-23 12:06:48

django url 驗證正則表達式（來源）：

解決方案4
90 2016-06-24 18:37:20

解決方案5
29 2018-09-22 10:55:09

解決方案6
18 2016-03-29 11:52:49

解決方案7
9 2011-08-24 22:35:10

解決方案8
5 2017-07-12 06:58:40

解決方案9
2 2019-04-24 10:14:37

使用`urllib`和類似 Django 的正則表達式驗證 URL

蟒蛇 3.7

解釋

IPv6 支持

例子

解決方案10
2 2020-05-10 17:46:55

解決方案11
0 2021-01-13 04:47:21

解決方案12
0 2021-04-14 20:59:40

解決方案13
0 2021-11-26 14:55:21

解決方案14
0 2023-01-23 14:23:04

解決方案15
0 2023-01-23 14:53:44

如何在 Python 中驗證 url？ （畸形與否）

問題描述

15 個解決方案

解決方案1 169 2015-08-23 21:46:01

解決方案2 134 2011-08-23 12:10:16

解決方案3 108 已采納 2011-08-23 12:06:48

django url 驗證正則表達式（ 來源）：

解決方案4 90 2016-06-24 18:37:20

解決方案5 29 2018-09-22 10:55:09

解決方案6 18 2016-03-29 11:52:49

解決方案7 9 2011-08-24 22:35:10

解決方案8 5 2017-07-12 06:58:40

解決方案9 2 2019-04-24 10:14:37

使用urllib和類似 Django 的正則表達式驗證 URL

蟒蛇 3.7

解釋

IPv6 支持

例子

解決方案10 2 2020-05-10 17:46:55

解決方案11 0 2021-01-13 04:47:21

解決方案12 0 2021-04-14 20:59:40

解決方案13 0 2021-11-26 14:55:21

解決方案14 0 2023-01-23 14:23:04

解決方案15 0 2023-01-23 14:53:44

如何在 Python 中驗證 url？（畸形與否）

解決方案1
169 2015-08-23 21:46:01

解決方案2
134 2011-08-23 12:10:16

解決方案3
108 已采納 2011-08-23 12:06:48

django url 驗證正則表達式（來源）：

解決方案4
90 2016-06-24 18:37:20

解決方案5
29 2018-09-22 10:55:09

解決方案6
18 2016-03-29 11:52:49

解決方案7
9 2011-08-24 22:35:10

解決方案8
5 2017-07-12 06:58:40

解決方案9
2 2019-04-24 10:14:37

使用`urllib`和類似 Django 的正則表達式驗證 URL

解決方案10
2 2020-05-10 17:46:55

解決方案11
0 2021-01-13 04:47:21

解決方案12
0 2021-04-14 20:59:40

解決方案13
0 2021-11-26 14:55:21

解決方案14
0 2023-01-23 14:23:04

解決方案15
0 2023-01-23 14:53:44