正则表达式替换两组

Question

I have pattern like : <[a-zA-Z][^>]*(?:poster|src)=(['\\"])([^'\\"]+)\\\\1[^>]*> here i want to replace the value of src or poster attributes. 我有这样的模式： <[a-zA-Z][^>]*(?:poster|src)=(['\\"])([^'\\"]+)\\\\1[^>]*>在这里，我想替换src或poster属性的值。

It is okey for 这是对的

<video src='srcVal' />

and 和

<video poster='posterVal' src='srcVal' />

but for 但对于

<video poster='posterVal' src='srcVal' />

only changes src value, due to matcher.group(2) returning only srcVal . 由于matcher.group(2)仅返回srcVal ，因此仅更改src值。

public class Test {
    public static void main(String[] args) throws Exception {
        String html = "<video poster='posterVal' src='srcVal' />";
        Pattern resourcePattern = Pattern.compile("<[a-zA-Z][^>]*(?:poster|src)=(['\"])([^'\"]+)\\1[^>]*>");
        Matcher matcher = resourcePattern.matcher(html);
        int last = 0;
        StringBuilder sb = new StringBuilder();
        while(matcher.find()) {
            String path = matcher.group(2) + "Changed";
            sb.append( html.substring(last, matcher.start(2)) + path );
            last = matcher.end(2);
        }
        sb.append(html.substring(last));
        System.out.println(sb);
        //outputs <video poster='posterVal' src='srcValChanged' />
        //expecting <video poster='posterValChanged' src='srcValChanged' />
    }
}

Does any body has an idea how to do this? 有没有人知道如何做到这一点？

Answer 1

The basic problem is with the [^>]* near the start of your expression. 基本问题是表达式开头附近的[^>]* 。 Because * is greedy this will eat up as many characters as it can while still allowing the rest of the expression to match, so given 因为*是贪婪的，这会吃掉尽可能多的字符，同时仍然允许表达式的其余部分匹配，因此给定

<video poster='posterVal' src='srcVal' />

the [^>]* will gobble ideo poster='posterVal' up to and including the space before src= . [^>]*将使ideo poster='posterVal'吞噬直至src=之前的空格。

I would approach it differently, rather than trying to write a regex that matches the whole tag just write one that matches the attributes you're interested in, and replace all matches of that expression 我会采取不同的方法，而不是尝试编写与整个标签匹配的正则表达式，而只是编写与您感兴趣的属性匹配的正则表达式，然后替换该表达式的所有匹配项

html.replaceAll("\\b((?:poster|src)=)(['\"])([^'\"]+)\\1", "$1$2$3Changed$2")

But as other posters have commented it would be much more sensible to use a proper parser that understands the language rather than trying to manipulate the textual representation with regular expressions. 但是，正如其他张贴者所评论的那样，使用一种能够理解该语言的适当解析器，而不是尝试使用正则表达式来操纵文本表示，会更加明智。

Answer 2

I wouldn't do this with regex, but you can try such a thing: 我不会使用正则表达式来执行此操作，但是您可以尝试以下操作：

<[a-zA-Z]*[^>]*(?:(poster)|src)=(['\"])([^'\"]+)\\2(?(1)[^>]*(?:src=(['\"])([^'\"]+)\\4)?[^>]*|[^>]*(?:poster=(['\"])([^'\"]+)\\6)?[^>]*)>

Though I don't have time to test it as of now, sorry. 尽管到目前为止我还没有时间进行测试，对不起。

Edit: 编辑：
Less performance-oriented: 不太注重性能：

<[a-zA-Z]*(?=(?:[^>]*?poster=['\"]([^'\"]+))?)(?=(?:[^>]*?src=['\"]([^'\"]+))?)[^>]*(?:poster|src)[^>]*>

If you only want to match video tags, change it to (as it'd greatly improve it): 如果您只想匹配视频标签，请将其更改为（这将大大改善它）：

<video(?=(?:[^>]*?poster=['\"]([^'\"]+))?)(?=(?:[^>]*?src=['\"]([^'\"]+))?)[^>]*(?:poster|src)[^>]*>

Explanation: (as I guess it must look quite disturbing) 说明：（因为我想它看起来一定很令人不安）

We're using 2 lookaheads to capture what's interesting. 我们正在使用2个前瞻记录来捕获有趣的内容。 Lookaheads will allow us to check twice what comes ahead, therefore ignoring the order. 先行者将使我们能够检查两次即将发生的事情，因此忽略了顺序。 However, those lookaheads must always work (using * and ? to make sure of that), but still being greedy, while being lazy (what?): we have to stop as soon as we see poster/src, but go far enough to catch those. 但是，这些前瞻必须始终有效（使用*和？来确保这一点），但仍要保持贪婪，同时要保持懒惰（是什么？）：看到海报/ src后，我们必须立即停止，但要走得足够远抓住那些。 .*?a? will always catch nothing. 永远一无所获。 So we use here (?:.*?a)? 所以我们在这里使用(?:.*?a)? . 。 The behavior here is to try to catch the a with laziness, while if it fails it's not a problem. 这里的行为是尝试以懒惰的方式捕获a ，而如果失败则不成问题。
The last part of the regex is to make sure we catch only tags with a poster or a src attribute, as our lookaheads only do that catching and certainly can't be used to do that. 正则表达式的最后一部分是确保我们仅捕获带有发帖人或src属性的标签，因为我们的先行者只会捕获该对象，并且肯定不能用来捕获它。

Note that I removed the check for your attributes, as anyway it was useless. 请注意，我删除了对您的属性的检查，因为无论如何它都没有用。

正则表达式替换两组

问题描述

2 个解决方案

解决方案1
0 2013-04-03 15:49:55

解决方案2
0 已采纳 2013-04-03 15:53:18

正则表达式替换两组

问题描述

2 个解决方案

解决方案1 0 2013-04-03 15:49:55

解决方案2 0 已采纳 2013-04-03 15:53:18

解决方案1
0 2013-04-03 15:49:55

解决方案2
0 已采纳 2013-04-03 15:53:18