Match link patterns in HTML code with a RegEx

Question

I'm using a linkify function, which detects link-like patterns by using regex and replaces those with a-tags to reveal a clickable link.

The regex looks like that:

    // http://, https://, ftp:// 
    var urlPattern = /\b(?![^<]*>|[^<>]*<\/)(?:https?|ftp):\/\/[a-z0-9-+&@#\/%?=~_|!:,.;]*[a-z0-9-+&@#\/%=~_|]/gim;
    /* Some explanations:
    (?!     # Negative lookahead start (will cause match to fail if contents match)
    [^<]*   # Any number of non-'<' characters
    >       # A > character
    |       # Or
    [^<>]*  # Any number of non-'<' and non-'>' characters
    </      # The characters < and /
     )      # End negative lookahead.
    */

and replaces the link like this:

 return textInput.replace(urlPattern, '<a target="_blank" rel="noopener" href="$&">$&</a>')

The regex works perfectly for in-text links. However, I am using it in HTML-Code also, such as

<ul><li>Link: https://www.link.com</li></ul> //linkify not working
<ul><li>Link: https://www.link.com <br/></li></ul> //linkify working

where just the secont example is working. I dont't know why the behavior is different and would be very glad to get some help from you. What should my regex look like, to linkify without the break in list elements?

Answer 1

Ciao,

if I understood correctly your issue I think that this regex should be ok to detect the links in both the scenarios:

\b(?![^<]*>)(?:https?|ftp):\/\/([a-z0-9-+&@#\/%?=~_|!:,.;]*)

Essentially with the first part we are segmenting in this way:

Then we go and grab the different parts of interest: the first part is a non-capturing group as in your original expression to strip the protocol later, if really unneeded. The last part takes the remaining part of the URL

For the way we created the regex we can now decide if taking the entire URL or just the second part. This is evident looking to the bottom-right of this screenshot:

Now in order to log the two parts we can take this nice snippet :

const str = '<ul><li>Link: https://www.link.com</li></ul>';
var myRegexp = /\b(?![^<]*>)(?:https?|ftp):\/\/([a-z0-9-+&@#\/%?=~_|!:,.;]*)/gim;
var match = myRegexp.exec(str);
console.log(match[0]);
console.log(match[1]);

Possible variations:

in a situation like the one presented above you can simplify further your regex to:
(?:https?|ftp):\\/\\/([a-z0-9-+&@#\\/%?=~_|!:,.;]*)

getting the same output

if the full URL is enough you can remove the round parentheses of the second group
(?:https?|ftp):\\/\\/[a-z0-9-+&@#\\/%?=~_|!:,.;]*

Have a good day,
Antonino

PS - I'm assuming that your examples were meant to be:

<ul><li>Link: https://www.link.com</li></ul>
<ul><li>Link: https://www.link.com <br/></li></ul>

ie with https , http or ftp which makes the second case work with your original regex

Match link patterns in HTML code with a RegEx

Question

1 answers

solution1
1 ACCPTED 2020-10-19 16:47:31

Match link patterns in HTML code with a RegEx

Question

1 answers

solution1 1 ACCPTED 2020-10-19 16:47:31

solution1
1 ACCPTED 2020-10-19 16:47:31