I have pattern like : <[a-zA-Z][^>]*(?:poster|src)=(['\\"])([^'\\"]+)\\\\1[^>]*>
here i want to replace the value of src or poster attributes.
It is okey for
<video src='srcVal' />
and
<video poster='posterVal' src='srcVal' />
but for
<video poster='posterVal' src='srcVal' />
only changes src value, due to matcher.group(2)
returning only srcVal
.
public class Test {
public static void main(String[] args) throws Exception {
String html = "<video poster='posterVal' src='srcVal' />";
Pattern resourcePattern = Pattern.compile("<[a-zA-Z][^>]*(?:poster|src)=(['\"])([^'\"]+)\\1[^>]*>");
Matcher matcher = resourcePattern.matcher(html);
int last = 0;
StringBuilder sb = new StringBuilder();
while(matcher.find()) {
String path = matcher.group(2) + "Changed";
sb.append( html.substring(last, matcher.start(2)) + path );
last = matcher.end(2);
}
sb.append(html.substring(last));
System.out.println(sb);
//outputs <video poster='posterVal' src='srcValChanged' />
//expecting <video poster='posterValChanged' src='srcValChanged' />
}
}
Does any body has an idea how to do this?
The basic problem is with the [^>]*
near the start of your expression. Because *
is greedy this will eat up as many characters as it can while still allowing the rest of the expression to match, so given
<video poster='posterVal' src='srcVal' />
the [^>]*
will gobble ideo poster='posterVal'
up to and including the space before src=
.
I would approach it differently, rather than trying to write a regex that matches the whole tag just write one that matches the attributes you're interested in, and replace all matches of that expression
html.replaceAll("\\b((?:poster|src)=)(['\"])([^'\"]+)\\1", "$1$2$3Changed$2")
But as other posters have commented it would be much more sensible to use a proper parser that understands the language rather than trying to manipulate the textual representation with regular expressions.
I wouldn't do this with regex, but you can try such a thing:
<[a-zA-Z]*[^>]*(?:(poster)|src)=(['\"])([^'\"]+)\\2(?(1)[^>]*(?:src=(['\"])([^'\"]+)\\4)?[^>]*|[^>]*(?:poster=(['\"])([^'\"]+)\\6)?[^>]*)>
Though I don't have time to test it as of now, sorry.
Edit:
Less performance-oriented:
<[a-zA-Z]*(?=(?:[^>]*?poster=['\"]([^'\"]+))?)(?=(?:[^>]*?src=['\"]([^'\"]+))?)[^>]*(?:poster|src)[^>]*>
If you only want to match video tags, change it to (as it'd greatly improve it):
<video(?=(?:[^>]*?poster=['\"]([^'\"]+))?)(?=(?:[^>]*?src=['\"]([^'\"]+))?)[^>]*(?:poster|src)[^>]*>
Explanation: (as I guess it must look quite disturbing)
We're using 2 lookaheads to capture what's interesting. Lookaheads will allow us to check twice what comes ahead, therefore ignoring the order. However, those lookaheads must always work (using * and ? to make sure of that), but still being greedy, while being lazy (what?): we have to stop as soon as we see poster/src, but go far enough to catch those. .*?a?
will always catch nothing. So we use here (?:.*?a)?
. The behavior here is to try to catch the a with laziness, while if it fails it's not a problem.
The last part of the regex is to make sure we catch only tags with a poster or a src attribute, as our lookaheads only do that catching and certainly can't be used to do that.
Note that I removed the check for your attributes, as anyway it was useless.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.