简体   繁体   English

正则表达式匹配 img 标签的 url

[英]Regular expression to match img tag's url

This regular expression:这个正则表达式:

<IMG\s([^"'>]+|'[^']*'|"[^"]*")+>

seems to process endlessly when given this text给定此文本时似乎无休止地处理

<img src=http://www.blahblahblah.com/houses/Images/
    single_and_multi/roof/feb09/01_img_trrnjks_vol2009.jpg' />

I would expect it to - not find a match (quickly) - because there is only one single quote in the text.我希望它 - 找不到匹配项(很快) - 因为文本中只有一个单引号。 I have had this happen in C# and also using the Expresso regex tool.我在 C# 和使用 Expresso 正则表达式工具时发生过这种情况。 If the text is a lot shorter it seems to work.如果文本短得多,它似乎可以工作。

<IMG\s([^"'>]+|'[^']*'|"[^"]*")+>

Taking out a couple of branches, the start and the end:取出几个分支,开始和结束:

([^"'>]+)+

How many ways can this match "hello"?这个匹配“你好”有多少种方式?

(hell)(o)
(hel)(lo)
(hel)(l)(o)
(he)(llo)
(he)(l)(lo)
(he)(l)(l)(o)
... and so on

Sounds like one of the situations where the regex engine is backtracking a lot.听起来像是正则表达式引擎回溯很多的情况之一。 Mastering Regular Expressions by Friedl has some good material on the topic. Friedl 的 Mastering Regular Expressions 有一些关于这个主题的好材料。

Other commenters have mentioned the complexity being the likely cause for the perfo problem.其他评论者提到复杂性是性能问题的可能原因。 I'd add that if you're trying to match something resembling an IMG tag, I think you want a regex more like this:我要补充一点,如果你想匹配类似于 IMG 标签的东西,我认为你想要一个更像这样的正则表达式:

<IMG(\s+[a-z]+=('[^']*'|"[^"]*"|[^\s'">]+))+>

Of course, there are still valid HTML variations that this regex won't catch.当然,仍然存在此正则表达式无法捕获的有效 HTML 变体。 Like a closing / (required in xhtml), or whitespace before the closing bracket.就像结束/ (在 xhtml 中需要),或结束括号前的空格。 And it will pass some invalid cases, like unsupported attribute names.它会传递一些无效的情况,例如不受支持的属性名称。

I think this is what you were trying for, I think the cause of your long running is as mentioned elsewhere, due to extreme repetition caused by the greedy grab for non-quote or > being or-ed with the string processors (also using greedy ["'>] matching.我认为这就是您要尝试的,我认为您长时间运行的原因正如其他地方所提到的,由于贪婪地抓取非引号或 > 与字符串处理器进行或运算(也使用 greedy ["'>] 匹配。

This seems to run swiftly with either correctly formatted or incorrectly formatted tags.这似乎使用正确格式或格式错误的标签快速运行。

<img(\s+((\w+)=(('[^']*?')|("[^"]*?"))))+? />

Could you post what you are exactly trying to find or extract?您能否发布您想要查找或提取的内容? Do you want to figure out what the img tag points to?你想弄清楚 img 标签指向什么吗? That would greatly increase the chances of being able to provide a better answer.这将大大增加能够提供更好答案的机会。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM