简体   繁体   English

正则表达式以匹配URL / URI,除非包含在img标签中

[英]Regex to match URL / URI except when contained in an img tag

Credit to dfowler 's excellent Jabbr project, I am borrowing code to embed linked content from user posts. 感谢dfowler出色的Jabbr项目,我借用代码来嵌入用户帖子中的链接内容。 The code is from here and uses a regex to extract URLs for additional processing and embedding. 该代码来自此处,并使用正则表达式提取URL进行其他处理和嵌入。

In my case, I run the user posts through a markdown processor first, before attempting this embed. 就我而言,在尝试嵌入之前,我先通过降价处理器运行用户帖子。 The markdown processor (MarkdownDeep) will, if the user formats the markdown correctly, transform any given image markdown into valid HTML img tag. 如果用户正确设置了降价格式,降价处理器(MarkdownDeep)将把任何给定的图像降价转换为有效的HTML img标签。 That works great, however, using the embedded content providers will make the image appear twice, since it shows up validly from the markdown transform, then gets embedded as well afterwards. 效果很好,但是,使用嵌入式内容提供程序会使图像出现两次,因为它会从markdown转换中有效显示,然后再进行嵌入。

So, I believe the solution to my problem lies in changing the regex to not match when the found URL is already contained within a valid img tag. 因此,我认为解决我的问题的办法是,当找到的URL已经包含在有效的img标记中时,将正则表达式更改为不匹配。

For ease of answering the regex so far is: 为了便于回答正则表达式,到目前为止:

(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'"".,<>?«»“”‘’]))

I think I want to use negative look-ahead like in this answer to exclude the img, but I'm too poor at regex syntax to implement it myself. 我想我想在此答案中使用否定的前瞻性来排除img,但我对正则表达式语法太不满意,无法自己实现。

NOTE: I want it to still match images if they just appear in the text. 注意:如果它们仅出现在文本中,我希望它仍然与图像匹配。 So http://www.example.com/sites/default/files/DellComputer.jpg would match or in a hyperlink <a href='http://www.example.com/sites/default/files/DellComputer.jpg'> would match but <img src='http://www.example.com/sites/default/files/DellComputer.jpg'> would not. 所以http://www.example.com/sites/default/files/DellComputer.jpg将匹配或在超链接<a href='http://www.example.com/sites/default/files/DellComputer.jpg'>会匹配,但<img src='http://www.example.com/sites/default/files/DellComputer.jpg'> <a href='http://www.example.com/sites/default/files/DellComputer.jpg'>不会匹配。

Thanks for the help, I know some of you have savant-level regex talents, I just never could do them. 感谢您的帮助,我知道你们中的某些人具有高级正则表达式的才能,但我从来没有做到过。

For the simple approach, just prepend 对于简单的方法,只需添加前缀

(?<!img.*)

to the beginning of your regex. 到正则表达式的开头。 It will match as it already does, but will reject it if img comes somewhere before it on the line. 它会像以前一样进行匹配,但是如果img出现在行中之前,它将拒绝它。 So, the entire regex: 因此,整个正则表达式:

(?<!img.*)(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'"".,<>?«»“”‘’]))

Again, not changed except a few characters on the beginning. 同样,除了开头的几个字符外,其他均未更改。

If you need it to be smarter about where the img is located on before it on the line, I would probably recommend using a tool other than regex. 如果您需要更聪明地了解img在其上的位置之前,我可能建议您使用正则表达式以外的工具。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM