简体   繁体   中英

Convert given relative urls to absolute urls

I need to convert few given relative urls in the given html text to absolute urls.

The html text would be mixed with relative and absolute urls and I need the result html text which should only contain the absolute urls with following rules.

  1. original html text contain mix of relative and absolute urls
  2. need to convert /test/1.html into https://www.example.com/test/1.html
  3. it should ignore the instance with absolute urls(both .com & .de) such as http://www.example.com/test/xxx.html , https://www.example.com/test/xxx.html , https://www.example.de/test/xxx.html , http://www.example.de/test/xxx.html

I know the best way to do that with preg_replace as I am using PHP and I tried the following code.

$server_url = "https://www.example.com";
$html = preg_replace('@(?<!https://www\.example\.com)(?<!http://www\.example\.com)(?<!https://www\.example\.de)(?<!http://www\.example\.de)/test@iU', $server_url.'/test', $html);

However, this doesn't give the desired results instead it has converted all the /test links including the existing absolute urls. So basically some urls were ended up like http://www.example.dehttp://www.example.com/test/xxx.html .

I'm not good at regex , please help me to find proper regex to get desired results.

This should match root -relative urls:

^(\/[^\/]{1}.*\.html)$

And the URL you want will be available in $1

https://regex101.com/r/E1evez/2


<?php
$urls = [
    '/test/1.html',
    'http://www.example.com/test/xxx.html',
    'https://www.example.de/test/xxx.html',
    '/relative/path/file.html'
];

foreach( $urls as $url )
{
    if( preg_match( '/^(\/[^\/]{1}.*\.html)$/', $url ) )
    {
        echo 'match: '.$url.PHP_EOL;
    }
    else
    {
        echo 'no match: '.$url.PHP_EOL;
    }
}

Outputs:

match: /test/1.html
no match: http://www.example.com/test/xxx.html
no match: https://www.example.de/test/xxx.html
match: /relative/path/file.html

If all the urls start with a forward slash, you might use:

(?<!\S)(?:/[^/\s]+)+/\S+\.html\S*

Explanation

  • (?<!\\S) Assert what is directly on the left is not a non whitespace char
  • (?:/[^/\\s]+)+ Repeat 1+ times matching / , then not / or a whitespace char using a negated character class
  • /\\S+ Match / and 1+ times a non whitespace char
  • \\.html\\S* Match .html as in the example data and 0+ times a non whitespace chars

Regex demo

If you also want to match /1.html you could use change the quantifier into )* instead of )+

To match more extensions than .html you might specify what you would allow to match like \\.(?:html|jpg|png) or perhaps use character class \\.[\\w-()] and add what you would allow to match.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM