简体   繁体   中英

Regular expression to match img tag's url

This regular expression:

<IMG\s([^"'>]+|'[^']*'|"[^"]*")+>

seems to process endlessly when given this text

<img src=http://www.blahblahblah.com/houses/Images/
    single_and_multi/roof/feb09/01_img_trrnjks_vol2009.jpg' />

I would expect it to - not find a match (quickly) - because there is only one single quote in the text. I have had this happen in C# and also using the Expresso regex tool. If the text is a lot shorter it seems to work.

<IMG\s([^"'>]+|'[^']*'|"[^"]*")+>

Taking out a couple of branches, the start and the end:

([^"'>]+)+

How many ways can this match "hello"?

(hell)(o)
(hel)(lo)
(hel)(l)(o)
(he)(llo)
(he)(l)(lo)
(he)(l)(l)(o)
... and so on

Sounds like one of the situations where the regex engine is backtracking a lot. Mastering Regular Expressions by Friedl has some good material on the topic.

Other commenters have mentioned the complexity being the likely cause for the perfo problem. I'd add that if you're trying to match something resembling an IMG tag, I think you want a regex more like this:

<IMG(\s+[a-z]+=('[^']*'|"[^"]*"|[^\s'">]+))+>

Of course, there are still valid HTML variations that this regex won't catch. Like a closing / (required in xhtml), or whitespace before the closing bracket. And it will pass some invalid cases, like unsupported attribute names.

I think this is what you were trying for, I think the cause of your long running is as mentioned elsewhere, due to extreme repetition caused by the greedy grab for non-quote or > being or-ed with the string processors (also using greedy ["'>] matching.

This seems to run swiftly with either correctly formatted or incorrectly formatted tags.

<img(\s+((\w+)=(('[^']*?')|("[^"]*?"))))+? />

Could you post what you are exactly trying to find or extract? Do you want to figure out what the img tag points to? That would greatly increase the chances of being able to provide a better answer.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM