简体   繁体   中英

Why this regex not giving expected output?

i have string which contains some value as given below. i want to replace the html img tags containing specific customerId with some new text. i tried small java program which is not giving me expected output.here is the program info

My input string is

 String inputText = "Starting here.. <img src=\"getCustomers.do?custCode=2&customerId=3334&param1=123/></p>"
    + "<p>someText</p><img src=\"getCustomers.do?custCode=2&customerId=3340&param2=456/> ..Ending here";

Regex is

  String regex = "(?s)\\<img.*?customerId=3340.*?>";

new text i want to put inside input string

EDIT Starts:

String newText = "<img src=\"getCustomerNew.do\">";

EDIT ENDS:

now i am doing

  String outputText = inputText.replaceAll(regex, newText);

output is

 Starting here.. Replacing Text ..Ending here

but my expected output is

 Starting here.. <img src=\"getCustomers.do?custCode=2&customerId=3334&param1=123/></p><p>someText</p>Replacing Text ..Ending here

Please note in my expected output only img tag which is containing customerId=3340 got replaced with Replacing Text. i am not getting why in the output i am getting both the img tags are getting replced?

You've got "wildcard"/"any" patterns ( .* ) in there which will extend the match to the longest possible matching string, and the last fixed text in the pattern is a > character, which therefore matches the last > character in the input text, ie the very last one!

You should be able to fix this by changing the .* parts to something like [^>]+ so that the matching won't span past the first > character.

Parsing HTML with regular expressions is bound to cause pain.

As other people have told you in the comments, HTML is not a regular language so using regex for manipulating it is usually painful. Your best option is to use an HTML parser. I haven't used Jsoup before, but googling a little bit it seems you need something like:

import org.jsoup.*;
import org.jsoup.nodes.*;
import org.jsoup.select.*;

public class MyJsoupExample {
    public static void main(String args[]) {
        String inputText = "<html><head></head><body><p><img src=\"getCustomers.do?custCode=2&customerId=3334&param1=123\"/></p>"
            + "<p>someText <img src=\"getCustomers.do?custCode=2&customerId=3340&param2=456\"/></p></body></html>";
        Document doc = Jsoup.parse(inputText);
        Elements myImgs = doc.select("img[src*=customerId=3340");
        for (Element element : myImgs) {
            element.replaceWith(new TextNode("my replaced text", ""));
        }
        System.out.println(doc.toString());
    }
}

Basically the code gets the list of img nodes with a src attribute containing a given string

Elements myImgs = doc.select("img[src*=customerId=3340");

then loop over the list and replace those nodes with some text.

UPDATE

If you don't want to replace the whole img node with text but instead you need to give a new value to its src attribute then you can replace the block of the for loop with:

element.attr("src", "my new value"));

or if you want to change just a part of the src value then you can do:

String srcValue = element.attr("src");
element.attr("src", srcValue.replace("getCustomers.do", "getCustonerNew.do"));

which is very similar to what I posted in this thread .

What happens is that your regex starts matching the first img tag then consumes everything (regardless is greedy or not) until it finds customerId=3340 and then continues consuming everything until it finds > .

If you want it to consume just the img with customerId=3340 think of what makes different this tag from other tags that it may match.

In this particular case, one possible solution is to look at what is behind that img tag using a look-behind operator (which doesn't consume a match). This regex will work:

String regex = "(?<=</p>)<img src=\".*?customerId=3340.*?>";

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM