[英]How to validate a url in Python? (Malformed or not)
I have url
from the user and I have to reply with the fetched HTML.我有来自用户的
url
,我必须用获取的 HTML 回复。
How can I check for the URL to be malformed or not?如何检查 URL 是否格式错误?
For example:例如:
url = 'google' # Malformed
url = 'google.com' # Malformed
url = 'http://google.com' # Valid
url = 'http://google' # Malformed
Use the validators package:使用验证器包:
>>> import validators
>>> validators.url("http://google.com")
True
>>> validators.url("http://google")
ValidationFailure(func=url, args={'value': 'http://google', 'require_tld': True})
>>> if not validators.url("http://google"):
... print "not valid"
...
not valid
>>>
Install it from PyPI with pip ( pip install validators
).使用 pip从 PyPI安装它(
pip install validators
)。
Actually, I think this is the best way.事实上,我认为这是最好的方法。
from django.core.validators import URLValidator
from django.core.exceptions import ValidationError
val = URLValidator(verify_exists=False)
try:
val('http://www.google.com')
except ValidationError, e:
print e
If you set verify_exists
to True
, it will actually verify that the URL exists, otherwise it will just check if it's formed correctly.如果您将
verify_exists
设置为True
,它实际上会验证 URL 是否存在,否则它只会检查它的格式是否正确。
edit: ah yeah, this question is a duplicate of this: How can I check if a URL exists with Django's validators?编辑:啊,是的,这个问题是这个问题的重复: 如何检查 Django 的验证器是否存在 URL?
import re
regex = re.compile(
r'^(?:http|ftp)s?://' # http:// or https://
r'(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+(?:[A-Z]{2,6}\.?|[A-Z0-9-]{2,}\.?)|' #domain...
r'localhost|' #localhost...
r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})' # ...or ip
r'(?::\d+)?' # optional port
r'(?:/?|[/?]\S+)$', re.IGNORECASE)
print(re.match(regex, "http://www.example.com") is not None) # True
print(re.match(regex, "example.com") is not None) # False
A True or False version, based on @DMfll answer:基于@DMfll 答案的对或错版本:
try:
# python2
from urlparse import urlparse
except:
# python3
from urllib.parse import urlparse
a = 'http://www.cwi.nl:80/%7Eguido/Python.html'
b = '/data/Python.html'
c = 532
d = u'dkakasdkjdjakdjadjfalskdjfalk'
e = 'https://stackoverflow.com'
def uri_validator(x):
try:
result = urlparse(x)
return all([result.scheme, result.netloc])
except:
return False
print(uri_validator(a))
print(uri_validator(b))
print(uri_validator(c))
print(uri_validator(d))
print(uri_validator(e))
Gives:给出:
True
False
False
False
True
Nowadays, I use the following, based on the Padam's answer:如今,我根据 Padam 的回答使用以下内容:
$ python --version
Python 3.6.5
And this is how it looks:这是它的外观:
from urllib.parse import urlparse
def is_url(url):
try:
result = urlparse(url)
return all([result.scheme, result.netloc])
except ValueError:
return False
Just use is_url("http://www.asdf.com")
.只需使用
is_url("http://www.asdf.com")
。
Hope it helps!希望能帮助到你!
I landed on this page trying to figure out a sane way to validate strings as "valid" urls.我登陆此页面,试图找出一种将字符串验证为“有效”网址的合理方法。 I share here my solution using python3.
我在这里分享我使用 python3 的解决方案。 No extra libraries required.
不需要额外的库。
See https://docs.python.org/2/library/urlparse.html if you are using python2.如果您使用的是 python2,请参阅https://docs.python.org/2/library/urlparse.html 。
See https://docs.python.org/3.0/library/urllib.parse.html if you are using python3 as I am.如果您像我一样使用 python3,请参阅https://docs.python.org/3.0/library/urllib.parse.html 。
import urllib
from pprint import pprint
invalid_url = 'dkakasdkjdjakdjadjfalskdjfalk'
valid_url = 'https://stackoverflow.com'
tokens = [urllib.parse.urlparse(url) for url in (invalid_url, valid_url)]
for token in tokens:
pprint(token)
min_attributes = ('scheme', 'netloc') # add attrs to your liking
for token in tokens:
if not all([getattr(token, attr) for attr in min_attributes]):
error = "'{url}' string has no scheme or netloc.".format(url=token.geturl())
print(error)
else:
print("'{url}' is probably a valid url.".format(url=token.geturl()))
ParseResult(scheme='', netloc='', path='dkakasdkjdjakdjadjfalskdjfalk', params='', query='', fragment='')
ParseResult(scheme='', netloc='', path='dkakasdkjdjakdjadjfalskdjfalk', params='', query='', fragment='')
ParseResult(scheme='https', netloc='stackoverflow.com', path='', params='', query='', fragment='')
ParseResult(scheme='https', netloc='stackoverflow.com', path='', params='', query='', fragment='')
'dkakasdkjdjakdjadjfalskdjfalk' string has no scheme or netloc.
'dkakasdkjdjakdjadjfalskdjfalk' 字符串没有方案或 netloc。
' https://stackoverflow.com ' is probably a valid url.
' https://stackoverflow.com ' 可能是一个有效的 url。
Here is a more concise function:这是一个更简洁的函数:
from urllib.parse import urlparse
min_attributes = ('scheme', 'netloc')
def is_valid(url, qualifying=min_attributes):
tokens = urlparse(url)
return all([getattr(tokens, qualifying_attr)
for qualifying_attr in qualifying])
note - lepl is no longer supported, sorry (you're welcome to use it, and i think the code below works, but it's not going to get updates).注意- lepl 不再受支持,抱歉(欢迎您使用它,我认为下面的代码有效,但不会得到更新)。
rfc 3696 http://www.faqs.org/rfcs/rfc3696.html defines how to do this (for http urls and email). rfc 3696 http://www.faqs.org/rfcs/rfc3696.html定义了如何执行此操作(对于 http url 和电子邮件)。 i implemented its recommendations in python using lepl (a parser library).
我使用 lepl(一个解析器库)在 python 中实现了它的建议。 see http://acooke.org/lepl/rfc3696.html
见http://acooke.org/lepl/rfc3696.html
to use:使用:
> easy_install lepl
...
> python
...
>>> from lepl.apps.rfc3696 import HttpUrl
>>> validator = HttpUrl()
>>> validator('google')
False
>>> validator('http://google')
False
>>> validator('http://google.com')
True
EDIT编辑
As pointed out by @Kwame , the below code does validate the url even if the
.com
or.co
etc are not present.正如@Kwame 所指出的,即使
.com
或.co
等不存在,以下代码也会验证 url。also pointed out by @Blaise, URLs like https://www.google is a valid URL and you need to do a DNS check for checking if it resolves or not, separately.
@Blaise 还指出,像https://www.google 这样的 URL 是一个有效的 URL,您需要单独进行 DNS 检查以检查它是否解析。
This is simple and works:这很简单并且有效:
So min_attr
contains the basic set of strings that needs to be present to define the validity of a URL, ie http://
part and google.com
part.所以
min_attr
包含需要存在以定义 URL 有效性的基本字符串集,即http://
部分和google.com
部分。
urlparse.scheme
stores http://
and urlparse.scheme
存储http://
和
urlparse.netloc
store the domain name google.com
urlparse.netloc
存放域名google.com
from urlparse import urlparse
def url_check(url):
min_attr = ('scheme' , 'netloc')
try:
result = urlparse(url)
if all([result.scheme, result.netloc]):
return True
else:
return False
except:
return False
all()
returns true if all the variables inside it return true.如果
all()
所有变量都返回 true,则返回 true。 So if result.scheme
and result.netloc
is present ie has some value then the URL is valid and hence returns True
.所以如果
result.scheme
和result.netloc
存在,即有一些值,那么 URL 是有效的,因此返回True
。
urllib
and Django-like regexurllib
和类似 Django 的正则表达式验证 URL The Django URL validation regex was actually pretty good but I needed to tweak it a little bit for my use case. Django URL 验证正则表达式实际上非常好,但我需要为我的用例稍微调整它。 Feel free to adapt it to yours!
随意适应它!
import re
import urllib
# Check https://regex101.com/r/A326u1/5 for reference
DOMAIN_FORMAT = re.compile(
r"(?:^(\w{1,255}):(.{1,255})@|^)" # http basic authentication [optional]
r"(?:(?:(?=\S{0,253}(?:$|:))" # check full domain length to be less than or equal to 253 (starting after http basic auth, stopping before port)
r"((?:[a-z0-9](?:[a-z0-9-]{0,61}[a-z0-9])?\.)+" # check for at least one subdomain (maximum length per subdomain: 63 characters), dashes in between allowed
r"(?:[a-z0-9]{1,63})))" # check for top level domain, no dashes allowed
r"|localhost)" # accept also "localhost" only
r"(:\d{1,5})?", # port [optional]
re.IGNORECASE
)
SCHEME_FORMAT = re.compile(
r"^(http|hxxp|ftp|fxp)s?$", # scheme: http(s) or ftp(s)
re.IGNORECASE
)
def validate_url(url: str):
url = url.strip()
if not url:
raise Exception("No URL specified")
if len(url) > 2048:
raise Exception("URL exceeds its maximum length of 2048 characters (given length={})".format(len(url)))
result = urllib.parse.urlparse(url)
scheme = result.scheme
domain = result.netloc
if not scheme:
raise Exception("No URL scheme specified")
if not re.fullmatch(SCHEME_FORMAT, scheme):
raise Exception("URL scheme must either be http(s) or ftp(s) (given scheme={})".format(scheme))
if not domain:
raise Exception("No URL domain specified")
if not re.fullmatch(DOMAIN_FORMAT, domain):
raise Exception("URL domain malformed (domain={})".format(domain))
return url
scheme
and netloc
part of a given URL.scheme
和netloc
部分。 (To do this properly, I split the URL with urllib.parse.urlparse()
in the two according parts which are then matched with the corresponding regex terms.) urllib.parse.urlparse()
分成两个部分,然后与相应的正则表达式项匹配。) The netloc
part stops before the first occurrence of a slash /
, so port
numbers are still part of the netloc
, eg: netloc
部分在第一次出现斜杠/
之前停止,因此port
号仍然是netloc
一部分,例如:
https://www.google.com:80/search?q=python ^^^^^ ^^^^^^^^^^^^^^^^^ | | | +-- netloc (aka "domain" in my code) +-- scheme
IPv4 addresses are also validated IPv4 地址也经过验证
If you want the URL validator to also work with IPv6 addresses, do the following:如果您希望 URL 验证器也使用 IPv6 地址,请执行以下操作:
is_valid_ipv6(ip)
from Markus Jarderot's answer , which has a really good IPv6 validator regexis_valid_ipv6(ip)
,它有一个非常好的 IPv6 验证器正则表达式and not is_valid_ipv6(domain)
to the last if
and not is_valid_ipv6(domain)
到最后, if
Here are some examples of the regex for the netloc
(aka domain
) part in action:以下是
netloc
(又名domain
)部分的正则表达式的一些示例:
All of the above solutions recognize a string like " http://www.google.com/path,www.yahoo.com/path " as valid.以上所有解决方案都将“ http://www.google.com/path,www.yahoo.com/path ”之类的字符串识别为有效。 This solution always works as it should
此解决方案始终可以正常工作
import re
# URL-link validation
ip_middle_octet = u"(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5]))"
ip_last_octet = u"(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))"
URL_PATTERN = re.compile(
u"^"
# protocol identifier
u"(?:(?:https?|ftp|rtsp|rtp|mmp)://)"
# user:pass authentication
u"(?:\S+(?::\S*)?@)?"
u"(?:"
u"(?P<private_ip>"
# IP address exclusion
# private & local networks
u"(?:localhost)|"
u"(?:(?:10|127)" + ip_middle_octet + u"{2}" + ip_last_octet + u")|"
u"(?:(?:169\.254|192\.168)" + ip_middle_octet + ip_last_octet + u")|"
u"(?:172\.(?:1[6-9]|2\d|3[0-1])" + ip_middle_octet + ip_last_octet + u"))"
u"|"
# IP address dotted notation octets
# excludes loopback network 0.0.0.0
# excludes reserved space >= 224.0.0.0
# excludes network & broadcast addresses
# (first & last IP address of each class)
u"(?P<public_ip>"
u"(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])"
u"" + ip_middle_octet + u"{2}"
u"" + ip_last_octet + u")"
u"|"
# host name
u"(?:(?:[a-z\u00a1-\uffff0-9_-]-?)*[a-z\u00a1-\uffff0-9_-]+)"
# domain name
u"(?:\.(?:[a-z\u00a1-\uffff0-9_-]-?)*[a-z\u00a1-\uffff0-9_-]+)*"
# TLD identifier
u"(?:\.(?:[a-z\u00a1-\uffff]{2,}))"
u")"
# port number
u"(?::\d{2,5})?"
# resource path
u"(?:/\S*)?"
# query string
u"(?:\?\S*)?"
u"$",
re.UNICODE | re.IGNORECASE
)
def url_validate(url):
""" URL string validation
"""
return re.compile(URL_PATTERN).match(url)
Not directly relevant, but often it's required to identify whether some token CAN be a url or not, not necessarily 100% correctly formed (ie, https part omitted and so on).不直接相关,但通常需要确定某些令牌是否可以是 url,不一定 100% 正确形成(即省略 https 部分等)。 I've read this post and did not find the solution, so I am posting my own here for the sake of completeness.
我已经阅读了这篇文章但没有找到解决方案,所以为了完整起见,我在这里发布我自己的。
def get_domain_suffixes():
import requests
res=requests.get('https://publicsuffix.org/list/public_suffix_list.dat')
lst=set()
for line in res.text.split('\n'):
if not line.startswith('//'):
domains=line.split('.')
cand=domains[-1]
if cand:
lst.add('.'+cand)
return tuple(sorted(lst))
domain_suffixes=get_domain_suffixes()
def reminds_url(txt:str):
"""
>>> reminds_url('yandex.ru.com/somepath')
True
"""
ltext=txt.lower().split('/')[0]
return ltext.startswith(('http','www','ftp')) or ltext.endswith(domain_suffixes)
Here's a regex solution since top voted regex doesn't work for weird cases like top-level domain.这是一个正则表达式解决方案,因为最高投票的正则表达式不适用于顶级域等奇怪的情况。 Some test cases down below.
下面是一些测试用例。
regex = re.compile(
r"(\w+://)?" # protocol (optional)
r"(\w+\.)?" # host (optional)
r"((\w+)\.(\w+))" # domain
r"(\.\w+)*" # top-level domain (optional, can have > 1)
r"([\w\-\._\~/]*)*(?<!\.)" # path, params, anchors, etc. (optional)
)
cases = [
"http://www.google.com",
"https://www.google.com",
"http://google.com",
"https://google.com",
"www.google.com",
"google.com",
"http://www.google.com/~as_db3.2123/134-1a",
"https://www.google.com/~as_db3.2123/134-1a",
"http://google.com/~as_db3.2123/134-1a",
"https://google.com/~as_db3.2123/134-1a",
"www.google.com/~as_db3.2123/134-1a",
"google.com/~as_db3.2123/134-1a",
# .co.uk top level
"http://www.google.co.uk",
"https://www.google.co.uk",
"http://google.co.uk",
"https://google.co.uk",
"www.google.co.uk",
"google.co.uk",
"http://www.google.co.uk/~as_db3.2123/134-1a",
"https://www.google.co.uk/~as_db3.2123/134-1a",
"http://google.co.uk/~as_db3.2123/134-1a",
"https://google.co.uk/~as_db3.2123/134-1a",
"www.google.co.uk/~as_db3.2123/134-1a",
"google.co.uk/~as_db3.2123/134-1a",
"https://...",
"https://..",
"https://.",
"https://.google.com",
"https://..google.com",
"https://...google.com",
"https://.google..com",
"https://.google...com"
"https://...google..com",
"https://...google...com",
".google.com",
".google.co."
"https://google.co."
]
for c in cases:
print(c, regex.match(c).span()[1] - regex.match(c).span()[0] == len(c))
Function based on Dominic Tarro answer:基于 Dominic Tarro 答案的函数:
import re
def is_url(x):
return bool(re.match(
r"(https?|ftp)://" # protocol
r"(\w+\.)?" # host (optional)
r"((\w+)\.(\w+))" # domain
r"(\.\w+)*" # top-level domain (optional, can have > 1)
r"([\w\-\._\~/]*)*(?<!\.)" # path, params, anchors, etc. (optional)
, x))
Pydantic could be used to do that. Pydantic 可以用来做到这一点。 I'm not very used to it so I can't say about it's limitations.
我不是很习惯,所以我不能说它的局限性。 It is an option thou and no one suggested it.
这是你的一个选择,没有人建议它。
I have seen that many people questioned about ftp and files URL in previous answers so I recommend to get known to the documentation as Pydantic have many types for validation as FileUrl, AnyUrl and even database url types.我看到很多人在之前的答案中质疑 ftp 和文件 URL,所以我建议让文档知道 Pydantic 有许多类型用于验证,如 FileUrl、AnyUrl 甚至数据库 url 类型。
A simplistic usage example:一个简单的用法示例:
from requests import get, HTTPError, ConnectionError
from pydantic import BaseModel, AnyHttpUrl, ValidationError
class MyConfModel(BaseModel):
URI: AnyHttpUrl
try:
myAddress = MyConfModel(URI = "http://myurl.com/")
req = get(myAddress.URI, verify=False)
print(myAddress.URI)
except(ValidationError):
print('Invalid destination')
Pydantic also raises exceptions (pydantic.ValidationError) that can be used to handle errors. Pydantic 还会引发可用于处理错误的异常 (pydantic.ValidationError)。
I have teste it with these patterns:我用这些模式测试了它:
from urllib.parse import urlparse
def is_valid_url(url):
try:
result = urlparse(url)
return all([result.scheme, result.netloc])
except ValueError:
return False
url = 'http://google.com'
if is_valid_url(url):
print('Valid URL')
else:
print('Malformed URL')
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.