简体   繁体   中英

Match link patterns in HTML code with a RegEx

I'm using a linkify function, which detects link-like patterns by using regex and replaces those with a-tags to reveal a clickable link.

The regex looks like that:

    // http://, https://, ftp:// 
    var urlPattern = /\b(?![^<]*>|[^<>]*<\/)(?:https?|ftp):\/\/[a-z0-9-+&@#\/%?=~_|!:,.;]*[a-z0-9-+&@#\/%=~_|]/gim;
    /* Some explanations:
    (?!     # Negative lookahead start (will cause match to fail if contents match)
    [^<]*   # Any number of non-'<' characters
    >       # A > character
    |       # Or
    [^<>]*  # Any number of non-'<' and non-'>' characters
    </      # The characters < and /
     )      # End negative lookahead.
    */
    

and replaces the link like this:

 return textInput.replace(urlPattern, '<a target="_blank" rel="noopener" href="$&">$&</a>')

The regex works perfectly for in-text links. However, I am using it in HTML-Code also, such as

<ul><li>Link: https://www.link.com</li></ul> //linkify not working
<ul><li>Link: https://www.link.com <br/></li></ul> //linkify working

where just the secont example is working. I dont't know why the behavior is different and would be very glad to get some help from you. What should my regex look like, to linkify without the break in list elements?

Ciao,

if I understood correctly your issue I think that this regex should be ok to detect the links in both the scenarios:

\b(?![^<]*>)(?:https?|ftp):\/\/([a-z0-9-+&@#\/%?=~_|!:,.;]*)

Essentially with the first part we are segmenting in this way:

regex_segmentation

Then we go and grab the different parts of interest: the first part is a non-capturing group as in your original expression to strip the protocol later, if really unneeded. The last part takes the remaining part of the URL

For the way we created the regex we can now decide if taking the entire URL or just the second part. This is evident looking to the bottom-right of this screenshot:

正则表达式处理

Now in order to log the two parts we can take this nice snippet :

const str = '<ul><li>Link: https://www.link.com</li></ul>';
var myRegexp = /\b(?![^<]*>)(?:https?|ftp):\/\/([a-z0-9-+&@#\/%?=~_|!:,.;]*)/gim;
var match = myRegexp.exec(str);
console.log(match[0]);
console.log(match[1]); 

Possible variations:

  • in a situation like the one presented above you can simplify further your regex to:

    (?:https?|ftp):\\/\\/([a-z0-9-+&@#\\/%?=~_|!:,.;]*)

getting the same output

  • if the full URL is enough you can remove the round parentheses of the second group

    (?:https?|ftp):\\/\\/[a-z0-9-+&@#\\/%?=~_|!:,.;]*

Have a good day,
Antonino

PS - I'm assuming that your examples were meant to be:

<ul><li>Link: https://www.link.com</li></ul>
<ul><li>Link: https://www.link.com <br/></li></ul>

ie with https , http or ftp which makes the second case work with your original regex

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM