How to check if a given link (url) is to file or another webpage?
I mean:
Currently I am doing it with a quite hacky, multi-step checking, and it also requires converting relative to absolute links, adding http prefix if missing and removing '#' anchor links/params to work. I am also not sure if I'm whitelisting all possible page extensions that exist .
import re
def check_file(url):
try:
sub_domain = re.split('\/+', url)[2] # part after '2nd slash(es)''
except:
return False # nothing = main page, no file
if not re.search('\.', sub_domain):
return False # no dot, no file
if re.search('\.htm[l]{0,1}$|\.php$|\.asp$', sub_domain):
return False # whitelist some page extensions
return True
tests = [
'https://www.stackoverflow.com',
'https://www.stackoverflow.com/randomlink',
'https:////www.stackoverflow.com//page.php',
'https://www.stackoverflow.com/page.html',
'https://www.stackoverflow.com/page.htm',
'https://www.stackoverflow.com/file.exe',
'https://www.stackoverflow.com/image.png'
]
for test in tests:
print(test + '\n' + str(check_file(test)))
# False: https://www.stackoverflow.com
# False: https://www.stackoverflow.com/randomlink
# False: https:////www.stackoverflow.com//page.php
# False: https://www.stackoverflow.com/page.html
# False: https://www.stackoverflow.com/page.htm
# True: https://www.stackoverflow.com/file.exe
# True: https://www.stackoverflow.com/image.png
Is there a clean, single regex match solution to this problem or a library with an established function to do it? I guess someone must have faced this problem before me, but unfortunately I couldn't find a solution here on SO or else.
urlparse
is your friend.
from urllib.parse import urlparse
def check_file(url):
path = urlparse(url).path # extract the path component of the URL
name = path.rsplit('/', 1)[-1] # discard everything before the last slash
if '.' not in name: # if there's no . it's definitely not a file
return False
ext = path.rsplit('.', 1)[-1] # extract the file extension
return ext not in {'htm', 'html', 'php', 'asp'}
This can be simplified further with the use of the pathlib
module:
from urllib.parse import urlparse
from pathlib import PurePath
def check_file(url):
path = PurePath(urlparse(url).path)
ext = path.suffix[1:]
if not ext:
return False
return ext not in {'htm', 'html', 'php', 'asp'}
Aran-Fey's answer works well on well-behaved pages, which make up 99.99% of the web. But there's no rule that says a url ending with a particular extension must resolve to content of a particular type. A poorly-configured server could return html for a request to a page named "example.png", or it could return an mpeg for a page named "example.php", or any other combination of content types and file extensions.
The most accurate way to get content type information for a url is to actually visit that url and examine the content type in its header. Most http-interfacing libraries have a way to retrieve only the header information from a site, so this operation should be relatively quick even for very large pages. For example, if you were using requests
, you might do:
import requests
def get_content_type(url):
response = requests.head(url)
return response.headers['Content-Type']
test_cases = [
"http://www.example.com",
"https://i.stack.imgur.com/T3HH6.png?s=328&g=1",
"http://php.net/manual/en/security.hiding.php",
]
for url in test_cases:
print("Url:", url)
print("Content type:", get_content_type(url))
Result:
Url: http://www.example.com
Content type: text/html; charset=UTF-8
Url: https://i.stack.imgur.com/T3HH6.png?s=328&g=1
Content type: image/png
Url: http://php.net/manual/en/security.hiding.php
Content type: text/html; charset=utf-8
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.