简体   繁体   中英

attributes pattern matcher takes a long time

I have a regex to get the src and the remaining attributes for all the images present in the content.

<img *((.|\s)*?) *src *= *['"]([^'"]*)['"] *((.|\s)*?) */*>

If the content I am matching against is like

<img src=src1"/> <img src=src2"/>

the find(index) hangs and I see the following in the thread dump

at java.util.regex.Pattern$LazyLoop.match(Pattern.java:4357)
    at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
    at java.util.regex.Pattern$BranchConn.match(Pattern.java:4078)
    at java.util.regex.Pattern$CharProperty.match(Pattern.java:3345)
    at java.util.regex.Pattern$Branch.match(Pattern.java:4114)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
    at java.util.regex.Pattern$LazyLoop.match(Pattern.java:4357)
    at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
    at java.util.regex.Pattern$BranchConn.match(Pattern.java:4078)
    at java.util.regex.Pattern$CharProperty.match(Pattern.java:3345)
    at java.util.regex.Pattern$Branch.match(Pattern.java:4114)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
    at java.util.regex.Pattern$LazyLoop.match(Pattern.java:4357)
    at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
    at java.util.regex.Pattern$BranchConn.match(Pattern.java:4078)
    at java.util.regex.Pattern$CharProperty.match(Pattern.java:3345)
    at java.util.regex.Pattern$Branch.match(Pattern.java:4114)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
    at java.util.regex.Pattern$LazyLoop.match(Pattern.java:4357)
    at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
    at java.util.regex.Pattern$BranchConn.match(Pattern.java:4078)
    at java.util.regex.Pattern$CharProperty.match(Pattern.java:3345) 

Is there a solution or a workaround for solving this issue?

A workaround is to use a HTML parser such as JSoup , for example

Document doc = 
      Jsoup.parse("<html><img src=\"src1\"/> <img src=\"src2\"/></html>");
Elements elements = doc.select("img[src]");
for (Element element: elements) {
    System.out.println(element.attr("src"));
    System.out.println(element.attr("alt"));
    System.out.println(element.attr("height"));
    System.out.println(element.attr("width"));
}

It looks like what you've got is an " evil regex ", which is not uncommon when you try to construct a complicated regex to match one thing (src) within another thing (img). In particular, evil regexs usually happen when you try to apply repetition to a complex subexpression, which you are doing with (.|\\s)*? .

A better approach would be to use two regexes; one to match all <img> tags, and then another to match the src attribute within it.

My Java's rusty, so I'll just give you the pseudocode solution:

foreach( imgTag in input.match( /<img .*?>/ig ) ) {
    src = imgTag.match( /\bsrc *= *(['\"])(.*?)\1/i );
    // if you want to get other attributes, you can do that the same way:
    alt = imgTag.match( /\balt *= *(['\"])(.*?)\1/i );
    // even better, you can get all the attributes in one go:
    attrs = imgTag.match( /\b(\w+) *= *(['\"])(.*?)\2/g );
    // attrs is now an array where the first group is the attr name
    // (alt, height, width, src, etc.) and the second group is the
    // attr value
}

Note the use of a backreference to match the appropriate type of closing quote (ie, this will match src='abc' and src="abc" . Also note that the quantifiers are lazy here ( *? instead of just * ); this is necessary to prevent too much from being consumed.

EDIT: even though my Java's rusty, I was able to crank out an example. Here's the solution in Java:

import java.util.regex.*;

public class Regex {

    public static void main( String[] args ) {
        String input = "<img alt=\"altText\" src=\"src\" height=\"50\" width=\"50\"/> <img alt='another image' src=\"foo.jpg\" />";
        Pattern attrPat = Pattern.compile( "\\b(\\w+) *= *(['\"])(.*?)\\2" );
        Matcher imgMatcher = Pattern.compile( "<img .*?>" ).matcher( input );
        while( imgMatcher.find() ) {
            String imgTag = imgMatcher.group();
            System.out.println( imgTag );
            Matcher attrMatcher = attrPat.matcher( imgTag );
            while( attrMatcher.find() ) {
                String attr = attrMatcher.group(1);
                System.out.format( "\tattr: %s, value: %s\n", attrMatcher.group(1), attrMatcher.group(3) );
            }
        }
    }
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM