简体   繁体   中英

Java Regex - How to replace a pattern or how to

I have a bunch of HTML files. In these files I need to correct the src attribute of the IMG tags. The IMG tags look typically like this:

<img alt="" src="./Suitbert_files/233px-Suitbertus.jpg" class="thumbimage" height="243" width="233" />`

where the attributes are NOT in any specific order. I need to remove the dot and the forward slash at the beginning of the src attribute of the IMG tags so they look like this:

<img alt="" src="Suitbert%20%E2%80%93%20Wikipedia_files/233px-Suitbertus.jpg" class="thumbimage" height="243" width="233" />

I have the following class so far:

import java.util.regex.*;


public class Replacer {

    // this PATTERN should find all img tags with 0 or more attributes before the src-attribute
    private static final String PATTERN = "<img\\.*\\ssrc=\"\\./";
    private static final String REPLACEMENT = "<img\\.*\\ssrc=\"";
    private static final Pattern COMPILED_PATTERN = Pattern.compile(PATTERN,  Pattern.CASE_INSENSITIVE);


    public static void findMatches(String html){
        Matcher matcher = COMPILED_PATTERN.matcher(html);
        // Check all occurance
        System.out.println("------------------------");
        System.out.println("Following Matches found:");
        while (matcher.find()) {
            System.out.print("Start index: " + matcher.start());
            System.out.print(" End index: " + matcher.end() + " ");
            System.out.println(matcher.group());
        }
        System.out.println("------------------------");
    }

    public static String replaceMatches(String html){
        //Pattern replace = Pattern.compile("\\s+");
        Matcher matcher = COMPILED_PATTERN.matcher(html);
        html = matcher.replaceAll(REPLACEMENT);
        return html;
    }
}

So, my method findMatches(String html) seems to find correctly all IMG tags where the src attributes starts with ./ .

Now my method replaceMatches(String html) does not correctly replace the matches. I am a newbie to regex, but I assume that either the REPLACEMENT regex is incorrect or the usage of the replaceAll method or both. A you can see, the replacement String contains 2 parts which are identical in all IMG tags: <img and src="./ . In between these 2 parts, there should be the 0 or more HTML attributes from the original string. How do I formulate such a REPLACEMENT string? Can somebody please enlighten me?

Don't use regex for HTML. Use a parser , obtain the src attribute and replace it.

Try these:

PATTERN = "(<img[^>]*\\ssrc=\")\\./"
REPLACEMENT = "$1"

Basically, you capture everything except the ./ in group #1, then plug it back in using the $1 placeholder, effectively stripping off the ./ .

Notice how I changed your .* to [^>]* , too. If there happened to be two IMG tags on the same line, like this:

<img src="good" /><img src="./bad" />

...your regex would match this:

<img src="good" /><img src="./

It would do that even if you used a non-greedy .*? . [^>]* makes sure the match is always contained within the one tag.

Your replacement is incorrect. It will replace the matched string by the replacement (not interpreted as a regexp). If you want to achieve, what you want, you need to use groups. A group is delimited by the parenthesis of the regexp. Each opening parenthesis indicates a new group. You can use $i in the replacement string to reproduce what a groupe has matched and where 'i' is your group number reference. See The doc of appendReplacement for the details.

// Here is an example (it looks a bit like your case but not exactly)
String input = "<img name=\"foobar\" src=\"img.png\">";
String regexp = "<img(.+)src=\"[^\"]+\"(.*)>";
Matcher m = Pattern.compile(regexp).matcher(input);
StringBuffer sb = new StringBuffer();
while(m.find()) {
    // Found a match!
    // Append all chars before the match and then replaces the match by the 
    // replacement (the replacement refers to group 1 & 2 with $1 & $2
    // which match respectively everything between '<img' and 'src' and,
    // everything after the src value and the closing >
    m.appendReplacement(sb, "<img$1src=\"something else\"$2>";
}
m.appendTail(sb);// No more match, we append the end of input

Hope this helps you

If src attributes only occur in your HTML within img tags, you can just do this:

input.replace("src=\"./", "src=\"")

You could also do this without java by using sed if you're using a *nix OS

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM