简体   繁体   English

如何在 Python 中验证 url? (畸形与否)

[英]How to validate a url in Python? (Malformed or not)

I have url from the user and I have to reply with the fetched HTML.我有来自用户的url ,我必须用获取的 HTML 回复。

How can I check for the URL to be malformed or not?如何检查 URL 是否格式错误?

For example:例如:

url = 'google' # Malformed
url = 'google.com' # Malformed
url = 'http://google.com' # Valid
url = 'http://google' # Malformed

Use the validators package:使用验证器包:

>>> import validators
>>> validators.url("http://google.com")
True
>>> validators.url("http://google")
ValidationFailure(func=url, args={'value': 'http://google', 'require_tld': True})
>>> if not validators.url("http://google"):
...     print "not valid"
... 
not valid
>>>

Install it from PyPI with pip ( pip install validators ).使用 pip从 PyPI安装它( pip install validators )。

Actually, I think this is the best way.事实上,我认为这是最好的方法。

from django.core.validators import URLValidator
from django.core.exceptions import ValidationError

val = URLValidator(verify_exists=False)
try:
    val('http://www.google.com')
except ValidationError, e:
    print e

If you set verify_exists to True , it will actually verify that the URL exists, otherwise it will just check if it's formed correctly.如果您将verify_exists设置为True ,它实际上会验证 URL 是否存在,否则它只会检查它的格式是否正确。

edit: ah yeah, this question is a duplicate of this: How can I check if a URL exists with Django's validators?编辑:啊,是的,这个问题是这个问题的重复: 如何检查 Django 的验证器是否存在 URL?

django url validation regex ( source ): django url 验证正则表达式( 来源):

import re
regex = re.compile(
        r'^(?:http|ftp)s?://' # http:// or https://
        r'(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+(?:[A-Z]{2,6}\.?|[A-Z0-9-]{2,}\.?)|' #domain...
        r'localhost|' #localhost...
        r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})' # ...or ip
        r'(?::\d+)?' # optional port
        r'(?:/?|[/?]\S+)$', re.IGNORECASE)

print(re.match(regex, "http://www.example.com") is not None) # True
print(re.match(regex, "example.com") is not None)            # False

A True or False version, based on @DMfll answer:基于@DMfll 答案的对或错版本:

try:
    # python2
    from urlparse import urlparse
except:
    # python3
    from urllib.parse import urlparse

a = 'http://www.cwi.nl:80/%7Eguido/Python.html'
b = '/data/Python.html'
c = 532
d = u'dkakasdkjdjakdjadjfalskdjfalk'
e = 'https://stackoverflow.com'

def uri_validator(x):
    try:
        result = urlparse(x)
        return all([result.scheme, result.netloc])
    except:
        return False

print(uri_validator(a))
print(uri_validator(b))
print(uri_validator(c))
print(uri_validator(d))
print(uri_validator(e))

Gives:给出:

True
False
False
False
True

Nowadays, I use the following, based on the Padam's answer:如今,我根据 Padam 的回答使用以下内容:

$ python --version
Python 3.6.5

And this is how it looks:这是它的外观:

from urllib.parse import urlparse

def is_url(url):
  try:
    result = urlparse(url)
    return all([result.scheme, result.netloc])
  except ValueError:
    return False

Just use is_url("http://www.asdf.com") .只需使用is_url("http://www.asdf.com")

Hope it helps!希望能帮助到你!

I landed on this page trying to figure out a sane way to validate strings as "valid" urls.我登陆此页面,试图找出一种将字符串验证为“有效”网址的合理方法。 I share here my solution using python3.我在这里分享我使用 python3 的解决方案。 No extra libraries required.不需要额外的库。

See https://docs.python.org/2/library/urlparse.html if you are using python2.如果您使用的是 python2,请参阅https://docs.python.org/2/library/urlparse.html

See https://docs.python.org/3.0/library/urllib.parse.html if you are using python3 as I am.如果您像我一样使用 python3,请参阅https://docs.python.org/3.0/library/urllib.parse.html

import urllib
from pprint import pprint

invalid_url = 'dkakasdkjdjakdjadjfalskdjfalk'
valid_url = 'https://stackoverflow.com'
tokens = [urllib.parse.urlparse(url) for url in (invalid_url, valid_url)]

for token in tokens:
    pprint(token)

min_attributes = ('scheme', 'netloc')  # add attrs to your liking
for token in tokens:
    if not all([getattr(token, attr) for attr in min_attributes]):
        error = "'{url}' string has no scheme or netloc.".format(url=token.geturl())
        print(error)
    else:
        print("'{url}' is probably a valid url.".format(url=token.geturl()))

ParseResult(scheme='', netloc='', path='dkakasdkjdjakdjadjfalskdjfalk', params='', query='', fragment='') ParseResult(scheme='', netloc='', path='dkakasdkjdjakdjadjfalskdjfalk', params='', query='', fragment='')

ParseResult(scheme='https', netloc='stackoverflow.com', path='', params='', query='', fragment='') ParseResult(scheme='https', netloc='stackoverflow.com', path='', params='', query='', fragment='')

'dkakasdkjdjakdjadjfalskdjfalk' string has no scheme or netloc. 'dkakasdkjdjakdjadjfalskdjfalk' 字符串没有方案或 netloc。

' https://stackoverflow.com ' is probably a valid url. ' https://stackoverflow.com ' 可能是一个有效的 url。

Here is a more concise function:这是一个更简洁的函数:

from urllib.parse import urlparse

min_attributes = ('scheme', 'netloc')


def is_valid(url, qualifying=min_attributes):
    tokens = urlparse(url)
    return all([getattr(tokens, qualifying_attr)
                for qualifying_attr in qualifying])

note - lepl is no longer supported, sorry (you're welcome to use it, and i think the code below works, but it's not going to get updates).注意- lepl 不再受支持,抱歉(欢迎您使用它,我认为下面的代码有效,但不会得到更新)。

rfc 3696 http://www.faqs.org/rfcs/rfc3696.html defines how to do this (for http urls and email). rfc 3696 http://www.faqs.org/rfcs/rfc3696.html定义了如何执行此操作(对于 http url 和电子邮件)。 i implemented its recommendations in python using lepl (a parser library).我使用 lepl(一个解析器库)在 python 中实现了它的建议。 see http://acooke.org/lepl/rfc3696.htmlhttp://acooke.org/lepl/rfc3696.html

to use:使用:

> easy_install lepl
...
> python
...
>>> from lepl.apps.rfc3696 import HttpUrl
>>> validator = HttpUrl()
>>> validator('google')
False
>>> validator('http://google')
False
>>> validator('http://google.com')
True

EDIT编辑

As pointed out by @Kwame , the below code does validate the url even if the .com or .co etc are not present.正如@Kwame 所指出的,即使.com.co等不存在,以下代码也会验证 url。

also pointed out by @Blaise, URLs like https://www.google is a valid URL and you need to do a DNS check for checking if it resolves or not, separately. @Blaise 还指出,像https://www.google 这样的 URL 是一个有效的 URL,您需要单独进行 DNS 检查以检查它是否解析。

This is simple and works:这很简单并且有效:

So min_attr contains the basic set of strings that needs to be present to define the validity of a URL, ie http:// part and google.com part.所以min_attr包含需要存在以定义 URL 有效性的基本字符串集,即http://部分和google.com部分。

urlparse.scheme stores http:// and urlparse.scheme存储http://

urlparse.netloc store the domain name google.com urlparse.netloc存放域名google.com

from urlparse import urlparse
def url_check(url):

    min_attr = ('scheme' , 'netloc')
    try:
        result = urlparse(url)
        if all([result.scheme, result.netloc]):
            return True
        else:
            return False
    except:
        return False

all() returns true if all the variables inside it return true.如果all()所有变量都返回 true,则返回 true。 So if result.scheme and result.netloc is present ie has some value then the URL is valid and hence returns True .所以如果result.schemeresult.netloc存在,即有一些值,那么 URL 是有效的,因此返回True

Validate URL with urllib and Django-like regex使用urllib和类似 Django 的正则表达式验证 URL

The Django URL validation regex was actually pretty good but I needed to tweak it a little bit for my use case. Django URL 验证正则表达式实际上非常好,但我需要为我的用例稍微调整它。 Feel free to adapt it to yours!随意适应它!

Python 3.7蟒蛇 3.7

import re
import urllib

# Check https://regex101.com/r/A326u1/5 for reference
DOMAIN_FORMAT = re.compile(
    r"(?:^(\w{1,255}):(.{1,255})@|^)" # http basic authentication [optional]
    r"(?:(?:(?=\S{0,253}(?:$|:))" # check full domain length to be less than or equal to 253 (starting after http basic auth, stopping before port)
    r"((?:[a-z0-9](?:[a-z0-9-]{0,61}[a-z0-9])?\.)+" # check for at least one subdomain (maximum length per subdomain: 63 characters), dashes in between allowed
    r"(?:[a-z0-9]{1,63})))" # check for top level domain, no dashes allowed
    r"|localhost)" # accept also "localhost" only
    r"(:\d{1,5})?", # port [optional]
    re.IGNORECASE
)
SCHEME_FORMAT = re.compile(
    r"^(http|hxxp|ftp|fxp)s?$", # scheme: http(s) or ftp(s)
    re.IGNORECASE
)

def validate_url(url: str):
    url = url.strip()

    if not url:
        raise Exception("No URL specified")

    if len(url) > 2048:
        raise Exception("URL exceeds its maximum length of 2048 characters (given length={})".format(len(url)))

    result = urllib.parse.urlparse(url)
    scheme = result.scheme
    domain = result.netloc

    if not scheme:
        raise Exception("No URL scheme specified")

    if not re.fullmatch(SCHEME_FORMAT, scheme):
        raise Exception("URL scheme must either be http(s) or ftp(s) (given scheme={})".format(scheme))

    if not domain:
        raise Exception("No URL domain specified")

    if not re.fullmatch(DOMAIN_FORMAT, domain):
        raise Exception("URL domain malformed (domain={})".format(domain))

    return url

Explanation解释

  • The code only validates the scheme and netloc part of a given URL.该代码仅验证给定 URL 的schemenetloc部分。 (To do this properly, I split the URL with urllib.parse.urlparse() in the two according parts which are then matched with the corresponding regex terms.) (为了正确地做到这一点,我将 URL 与urllib.parse.urlparse()分成两个部分,然后与相应的正则表达式项匹配。)
  • The netloc part stops before the first occurrence of a slash / , so port numbers are still part of the netloc , eg: netloc部分在第一次出现斜杠/之前停止,因此port号仍然是netloc一部分,例如:

     https://www.google.com:80/search?q=python ^^^^^ ^^^^^^^^^^^^^^^^^ | | | +-- netloc (aka "domain" in my code) +-- scheme
  • IPv4 addresses are also validated IPv4 地址也经过验证

IPv6 Support IPv6 支持

If you want the URL validator to also work with IPv6 addresses, do the following:如果您希望 URL 验证器也使用 IPv6 地址,请执行以下操作:

  • Add is_valid_ipv6(ip) from Markus Jarderot's answer , which has a really good IPv6 validator regexMarkus Jarderot 的答案中添加is_valid_ipv6(ip) ,它有一个非常好的 IPv6 验证器正则表达式
  • Add and not is_valid_ipv6(domain) to the last if添加and not is_valid_ipv6(domain)到最后, if

Examples例子

Here are some examples of the regex for the netloc (aka domain ) part in action:以下是netloc (又名domain )部分的正则表达式的一些示例:

All of the above solutions recognize a string like " http://www.google.com/path,www.yahoo.com/path " as valid.以上所有解决方案都将“ http://www.google.com/path,www.yahoo.com/path ”之类的字符串识别为有效。 This solution always works as it should此解决方案始终可以正常工作

import re

# URL-link validation
ip_middle_octet = u"(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5]))"
ip_last_octet = u"(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))"

URL_PATTERN = re.compile(
                        u"^"
                        # protocol identifier
                        u"(?:(?:https?|ftp|rtsp|rtp|mmp)://)"
                        # user:pass authentication
                        u"(?:\S+(?::\S*)?@)?"
                        u"(?:"
                        u"(?P<private_ip>"
                        # IP address exclusion
                        # private & local networks
                        u"(?:localhost)|"
                        u"(?:(?:10|127)" + ip_middle_octet + u"{2}" + ip_last_octet + u")|"
                        u"(?:(?:169\.254|192\.168)" + ip_middle_octet + ip_last_octet + u")|"
                        u"(?:172\.(?:1[6-9]|2\d|3[0-1])" + ip_middle_octet + ip_last_octet + u"))"
                        u"|"
                        # IP address dotted notation octets
                        # excludes loopback network 0.0.0.0
                        # excludes reserved space >= 224.0.0.0
                        # excludes network & broadcast addresses
                        # (first & last IP address of each class)
                        u"(?P<public_ip>"
                        u"(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])"
                        u"" + ip_middle_octet + u"{2}"
                        u"" + ip_last_octet + u")"
                        u"|"
                        # host name
                        u"(?:(?:[a-z\u00a1-\uffff0-9_-]-?)*[a-z\u00a1-\uffff0-9_-]+)"
                        # domain name
                        u"(?:\.(?:[a-z\u00a1-\uffff0-9_-]-?)*[a-z\u00a1-\uffff0-9_-]+)*"
                        # TLD identifier
                        u"(?:\.(?:[a-z\u00a1-\uffff]{2,}))"
                        u")"
                        # port number
                        u"(?::\d{2,5})?"
                        # resource path
                        u"(?:/\S*)?"
                        # query string
                        u"(?:\?\S*)?"
                        u"$",
                        re.UNICODE | re.IGNORECASE
                       )
def url_validate(url):   
    """ URL string validation
    """                                                                                                                                                      
    return re.compile(URL_PATTERN).match(url)

Not directly relevant, but often it's required to identify whether some token CAN be a url or not, not necessarily 100% correctly formed (ie, https part omitted and so on).不直接相关,但通常需要确定某些令牌是否可以是 url,不一定 100% 正确形成(即省略 https 部分等)。 I've read this post and did not find the solution, so I am posting my own here for the sake of completeness.我已经阅读了这篇文章但没有找到解决方案,所以为了完整起见,我在这里发布我自己的。

def get_domain_suffixes():
    import requests
    res=requests.get('https://publicsuffix.org/list/public_suffix_list.dat')
    lst=set()
    for line in res.text.split('\n'):
        if not line.startswith('//'):
            domains=line.split('.')
            cand=domains[-1]
            if cand:
                lst.add('.'+cand)
    return tuple(sorted(lst))

domain_suffixes=get_domain_suffixes()

def reminds_url(txt:str):
    """
    >>> reminds_url('yandex.ru.com/somepath')
    True
    
    """
    ltext=txt.lower().split('/')[0]
    return ltext.startswith(('http','www','ftp')) or ltext.endswith(domain_suffixes)

Here's a regex solution since top voted regex doesn't work for weird cases like top-level domain.这是一个正则表达式解决方案,因为最高投票的正则表达式不适用于顶级域等奇怪的情况。 Some test cases down below.下面是一些测试用例。

regex = re.compile(
    r"(\w+://)?"                # protocol                      (optional)
    r"(\w+\.)?"                 # host                          (optional)
    r"((\w+)\.(\w+))"           # domain
    r"(\.\w+)*"                 # top-level domain              (optional, can have > 1)
    r"([\w\-\._\~/]*)*(?<!\.)"  # path, params, anchors, etc.   (optional)
)
cases = [
    "http://www.google.com",
    "https://www.google.com",
    "http://google.com",
    "https://google.com",
    "www.google.com",
    "google.com",
    "http://www.google.com/~as_db3.2123/134-1a",
    "https://www.google.com/~as_db3.2123/134-1a",
    "http://google.com/~as_db3.2123/134-1a",
    "https://google.com/~as_db3.2123/134-1a",
    "www.google.com/~as_db3.2123/134-1a",
    "google.com/~as_db3.2123/134-1a",
    # .co.uk top level
    "http://www.google.co.uk",
    "https://www.google.co.uk",
    "http://google.co.uk",
    "https://google.co.uk",
    "www.google.co.uk",
    "google.co.uk",
    "http://www.google.co.uk/~as_db3.2123/134-1a",
    "https://www.google.co.uk/~as_db3.2123/134-1a",
    "http://google.co.uk/~as_db3.2123/134-1a",
    "https://google.co.uk/~as_db3.2123/134-1a",
    "www.google.co.uk/~as_db3.2123/134-1a",
    "google.co.uk/~as_db3.2123/134-1a",
    "https://...",
    "https://..",
    "https://.",
    "https://.google.com",
    "https://..google.com",
    "https://...google.com",
    "https://.google..com",
    "https://.google...com"
    "https://...google..com",
    "https://...google...com",
    ".google.com",
    ".google.co."
    "https://google.co."
]
for c in cases:
    print(c, regex.match(c).span()[1] - regex.match(c).span()[0] == len(c))

Function based on Dominic Tarro answer:基于 Dominic Tarro 答案的函数:

import re
def is_url(x):
    return bool(re.match(
        r"(https?|ftp)://" # protocol
        r"(\w+\.)?" # host (optional)
        r"((\w+)\.(\w+))" # domain
        r"(\.\w+)*" # top-level domain (optional, can have > 1)
        r"([\w\-\._\~/]*)*(?<!\.)" # path, params, anchors, etc. (optional)
    , x))

Pydantic could be used to do that. Pydantic 可以用来做到这一点。 I'm not very used to it so I can't say about it's limitations.我不是很习惯,所以我不能说它的局限性。 It is an option thou and no one suggested it.这是你的一个选择,没有人建议它。

I have seen that many people questioned about ftp and files URL in previous answers so I recommend to get known to the documentation as Pydantic have many types for validation as FileUrl, AnyUrl and even database url types.我看到很多人在之前的答案中质疑 ftp 和文件 URL,所以我建议让文档知道 Pydantic 有许多类型用于验证,如 FileUrl、AnyUrl 甚至数据库 url 类型。

A simplistic usage example:一个简单的用法示例:

from requests import get, HTTPError, ConnectionError
from pydantic import BaseModel, AnyHttpUrl, ValidationError
    
class MyConfModel(BaseModel):
    URI: AnyHttpUrl

try:
    myAddress = MyConfModel(URI = "http://myurl.com/")
    req = get(myAddress.URI, verify=False)
    print(myAddress.URI)

except(ValidationError):
    print('Invalid destination')

Pydantic also raises exceptions (pydantic.ValidationError) that can be used to handle errors. Pydantic 还会引发可用于处理错误的异常 (pydantic.ValidationError)。

I have teste it with these patterns:我用这些模式测试了它:

from urllib.parse import urlparse

def is_valid_url(url):
    try:
        result = urlparse(url)
        return all([result.scheme, result.netloc])
    except ValueError:
        return False

url = 'http://google.com'
if is_valid_url(url):
    print('Valid URL')
else:
    print('Malformed URL')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM