简体   繁体   中英

Regex for Replacing Absolute URLs with Relative URLs

How can I write a regex expression that will convert any absolute URLs to relative paths. For example:

src="http://www.test.localhost/sites/ 

would become

src="/sites/"

The domains are not static.

I can't use parse_url (as per this answer ) because it is part of a larger string, that contains no-url data as well.

Solution

You can use the following regex:

/https?:\/{2}[^\/]+/

Which would match the following:

http://www.test.localhost/sites/
http://www.domain.localhost/sites/
http://domain.localhost/sites/

So it would be:

$domain = preg_replace('/https?:\/{2}[^\/]+/', '', $domain);

Explanation

http: Look for 'http'
s?: Look for an 's' after the 'http' if there's one
: : Look for the ':' character
\/{2}: Look for the '//'
[^\/]+: Go for anything that is not a slash (/)

My guess is that maybe this expression or an improved version of that might work to some extent:

^\s*src=["']\s*https?:\/\/(?:[^\/]+)([^"']+?)\s*["']$

The expression is explained on the top right panel of this demo if you wish to explore/simplify/modify it.


Test

$re = '/^\s*src=["\']\s*https?:\/\/(?:[^\/]+)([^"\']+?)\s*["\']$/m';
$str = 'src=" http://www.test.localhost/sites/  "
src=" https://www.test.localhost/sites/"
src=" http://test.localhost/sites/   "
  src="https://test.localhost/sites/   "
      src="https://localhost/sites/   "
src=\'https://localhost/   \'
src=\'http://www.test1.test2.test3.test4localhost/sites1/sites2/sites3/   \'';
$subst = 'src="$1"';

var_export(preg_replace($re, $subst, $str));

Output

src="/sites/"
src="/sites/"
src="/sites/"
src="/sites/"
src="/sites/"
src="/"
src="/sites1/sites2/sites3/"

RegEx Circuit

jex.im visualizes regular expressions:

在此处输入图片说明

$dom = new DOMDocument;
$dom->loadHTML($yourHTML)
$xp = new DOMXPath($dom);

foreach($xp->query('//@src') as $attr) {
    $url = parse_url($attr->nodeValue);

    if ( !isset($url['scheme']) || stripos($url['scheme'], 'http']) !== 0 )
        continue;

    $src = $url['path']
         . ( isset($url['query']) ? '?' . $url['query'] : '' )
         . ( isset($url['fragment']) ? '#' . $url['fragment'] : '' );

    $attr->parentNode->setAttribute('src', $src);
}

$result = $dom->saveHTML();

I added an if condition to skip cases when it isn't possible to say if the beginning of the src attribute is a domain or the beginning of the path. Depending of what you are trying to do, you can remove this test.

If you are working with parts of an html document (ie: not a full document), you have to change $result = $dom->saveHTML() with something like:

$result = '';
foreach ($dom->getElementsByTagName('body')->item(0)->childNodes as $childNode) {
    $result . = $dom->saveHTML($childNode);
}  

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM