简体   繁体   中英

Regex check if link is to a file

How to check if a given link (url) is to file or another webpage?

I mean:

Currently I am doing it with a quite hacky, multi-step checking, and it also requires converting relative to absolute links, adding http prefix if missing and removing '#' anchor links/params to work. I am also not sure if I'm whitelisting all possible page extensions that exist .

import re
def check_file(url):
    try:
        sub_domain = re.split('\/+', url)[2] # part after '2nd slash(es)''
    except:
        return False # nothing = main page, no file
    if not re.search('\.', sub_domain):
        return False # no dot, no file
    if re.search('\.htm[l]{0,1}$|\.php$|\.asp$', sub_domain):
        return False # whitelist some page extensions
    return True

tests = [
    'https://www.stackoverflow.com',
    'https://www.stackoverflow.com/randomlink',
    'https:////www.stackoverflow.com//page.php',
    'https://www.stackoverflow.com/page.html',
    'https://www.stackoverflow.com/page.htm',
    'https://www.stackoverflow.com/file.exe',
    'https://www.stackoverflow.com/image.png'
]

for test in tests:
    print(test + '\n' + str(check_file(test)))
# False: https://www.stackoverflow.com
# False: https://www.stackoverflow.com/randomlink
# False: https:////www.stackoverflow.com//page.php
# False: https://www.stackoverflow.com/page.html
# False: https://www.stackoverflow.com/page.htm
# True: https://www.stackoverflow.com/file.exe
# True: https://www.stackoverflow.com/image.png

Is there a clean, single regex match solution to this problem or a library with an established function to do it? I guess someone must have faced this problem before me, but unfortunately I couldn't find a solution here on SO or else.

urlparse is your friend.

from urllib.parse import urlparse

def check_file(url):
    path = urlparse(url).path  # extract the path component of the URL
    name = path.rsplit('/', 1)[-1]  # discard everything before the last slash

    if '.' not in name:  # if there's no . it's definitely not a file
        return False

    ext = path.rsplit('.', 1)[-1]  # extract the file extension
    return ext not in {'htm', 'html', 'php', 'asp'}

This can be simplified further with the use of the pathlib module:

from urllib.parse import urlparse
from pathlib import PurePath

def check_file(url):
    path = PurePath(urlparse(url).path)
    ext = path.suffix[1:]

    if not ext:
        return False

    return ext not in {'htm', 'html', 'php', 'asp'}

Aran-Fey's answer works well on well-behaved pages, which make up 99.99% of the web. But there's no rule that says a url ending with a particular extension must resolve to content of a particular type. A poorly-configured server could return html for a request to a page named "example.png", or it could return an mpeg for a page named "example.php", or any other combination of content types and file extensions.

The most accurate way to get content type information for a url is to actually visit that url and examine the content type in its header. Most http-interfacing libraries have a way to retrieve only the header information from a site, so this operation should be relatively quick even for very large pages. For example, if you were using requests , you might do:

import requests
def get_content_type(url):
    response = requests.head(url)
    return response.headers['Content-Type']

test_cases = [
    "http://www.example.com",
    "https://i.stack.imgur.com/T3HH6.png?s=328&g=1",
    "http://php.net/manual/en/security.hiding.php",
]    

for url in test_cases:
    print("Url:", url)
    print("Content type:", get_content_type(url))

Result:

Url: http://www.example.com
Content type: text/html; charset=UTF-8
Url: https://i.stack.imgur.com/T3HH6.png?s=328&g=1
Content type: image/png
Url: http://php.net/manual/en/security.hiding.php
Content type: text/html; charset=utf-8

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM