简体   繁体   中英

regex for all urls in string not capturing urls with question mark

I start with this PHP string.

$bodyString = '
    another 1 body
    reg http://www.regularurl.com/home
    secure https://facebook.com/anothergreat.
    a subdomain http://info.craig.org/
    dynamic; http://www.spring1.com/link.asp?id=100408
    www domain; at www.wideweb.com
    single no subdomain; simple.com';

Need to turn all domains, urls into anchor( <a> ) elements.

preg_replace('#[-a-zA-Z0-9@:%_\\+.~\\#?&//=]{2,256}\\.[az]{2,4}\\b(\\/[-a-zA-Z0-9@:%_\\+.~\\#?&//=]*)?#si', '<a href="$0">$0</a>', $bodyString)

$bodyString result:

'another 1 body
    reg <ahref="http://www.regularurl.com/home">http://www.regularurl.com/home</a>
    secure <a href="https://facebook.com/anothergreat.">https://facebook.com/anothergreat.</a>
    a subdomain <a href="http://info.craig.org/">http://info.craig.org/</a>
    dynamic; <a href="http://www.spring1.com/link.asp">http://www.spring1.com/link.asp</a>?id=100408
    www domain; at <a href="www.wideweb.com">www.wideweb.com</a>
    single no subdomain; <a href="simple.com">simple.com</a>';

Result: All urls, domains are turned into <a> except http://www.spring1.com/link.asp?id=100408

What is missing in the regex to make this work?

$bodyString = '
    another 1 body
    reg http://www.regularurl.com/home
    secure https://facebook.com/anothergreat.
    a subdomain http://info.craig.org/
    dynamic; http://www.spring1.com/link.asp?id=100408
    www domain; at www.wideweb.com
    single no subdomain; simple.com';

$regex = '@(http)?(s)?(://)?(([a-zA-Z])([-\w]+\.)+([^\s\.]+[^\s]*)+[^,.\s])@'; 
$converted_string = preg_replace($regex, '<a href="$0">$0</a>', $bodyString);
echo $converted_string;

Demo

Regex Explanation here

Building on @WiktorStribiżew's comment, you could try this:

[^\s]{2,256}\.[a-z]{2,4}\b(?:[?/][^\s]*)*

Trial over here

Note - Although there are already 2 answers as of now, this seems to be more concise, using [^\\s]

Explanation -

[^\\s]{2,256} matches 2 to 256 characters which is the https://facebook and https://www.randomdomain part,
\\. matches the dot after that,
[az]{2,4} is the domain extension eg: com , in etc.
\\b is the word boundary,
(?:[?/][^\\s]*)* is a non-capturing group which matches either a slash / or question mark ? and more of the url, all of which can be repeated zero-or-more times, indicating the sub-pages of the URL.

To gain a better understanding of Regex Syntax, you should try this website: rexegg.com

[-\\w@:%+.\\~#?&/=]{2,256}\\.[az]{2,4}\\b[^\\s]*

[^\\s]* will add any non-space character to the url. When there's a space its not a part of the URL. Simple and easy.

working url here

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM