为什么这个正则表达式没有给出预期的输出？

Question

我有包含下面给出的一些值的字符串。 我想用一些新文本替换包含特定customerId的html img标签 。 我尝试了小的Java程序，它没有给我预期的输出。这是程序信息

我的输入字符串是

 String inputText = "Starting here.. <img src=\"getCustomers.do?custCode=2&customerId=3334&param1=123/></p>"
    + "<p>someText</p><img src=\"getCustomers.do?custCode=2&customerId=3340&param2=456/> ..Ending here";

正则表达式是

  String regex = "(?s)\\<img.*?customerId=3340.*?>";

我想放在输入字符串中的新文本

编辑开始：

String newText = "<img src=\"getCustomerNew.do\">";

编辑结束：

现在我在做

  String outputText = inputText.replaceAll(regex, newText);

输出是

 Starting here.. Replacing Text ..Ending here

但我的预期输出是

 Starting here.. <img src=\"getCustomers.do?custCode=2&customerId=3334&param1=123/></p><p>someText</p>Replacing Text ..Ending here

请注意，在我的预期输出中，仅将包含customerId = 3340的img标签替换为“替换文本”。 我不明白为什么在输出中我同时得到了两个img标签？

Answer 1

您在其中具有“通配符” /“任何”模式（ .* ），它将匹配项扩展到可能的最长匹配字符串，并且模式中的最后一个固定文本为>字符，因此与最后一个>字符匹配在输入文本中，即最后一个！

您应该可以通过将.*部分更改为[^>]+类的内容来解决此问题，以使匹配不会超出第一个>字符。

使用正则表达式解析HTML势必会引起麻烦。

Answer 2

正如其他人在评论中告诉您的那样，HTML不是常规语言，因此使用正则表达式进行操作通常很痛苦。 最好的选择是使用HTML解析器。 我以前没有使用过Jsoup，但是在谷歌搜索了一下之后，您似乎需要这样的东西：

import org.jsoup.*;
import org.jsoup.nodes.*;
import org.jsoup.select.*;

public class MyJsoupExample {
    public static void main(String args[]) {
        String inputText = "<html><head></head><body><p><img src=\"getCustomers.do?custCode=2&customerId=3334&param1=123\"/></p>"
            + "<p>someText <img src=\"getCustomers.do?custCode=2&customerId=3340&param2=456\"/></p></body></html>";
        Document doc = Jsoup.parse(inputText);
        Elements myImgs = doc.select("img[src*=customerId=3340");
        for (Element element : myImgs) {
            element.replaceWith(new TextNode("my replaced text", ""));
        }
        System.out.println(doc.toString());
    }
}

基本上，代码获取具有包含给定字符串的src属性的img节点列表。

Elements myImgs = doc.select("img[src*=customerId=3340");

然后遍历列表，并用一些文本替换那些节点。

UPDATE

如果您不想将整个img节点替换为文本，而是需要为其src属性赋予一个新值，则可以将for循环的块替换for ：

element.attr("src", "my new value"));

或者，如果您只想更改src值的一部分，则可以执行以下操作：

String srcValue = element.attr("src");
element.attr("src", srcValue.replace("getCustomers.do", "getCustonerNew.do"));

这与我在此主题中发布的内容非常相似。

Answer 3

发生的情况是，您的正则表达式开始与第一个img标签匹配，然后消耗所有内容（无论是否贪婪），直到找到customerId = 3340 ，然后继续消耗所有内容，直到找到>为止。

如果您只想使用customerId = 3340的img ，请考虑一下此标签与可能匹配的其他标签有何不同。

在这种特殊情况下，一种可能的解决方案是使用后向运算符（不消耗匹配项）查看img标记后面的内容。 此正则表达式将起作用：

String regex = "(?<=</p>)<img src=\".*?customerId=3340.*?>";

为什么这个正则表达式没有给出预期的输出？

问题描述

3 个解决方案

解决方案1
4 2012-12-13 18:18:15

解决方案2
1 已采纳 2012-12-13 19:52:33

解决方案3
0 2012-12-15 15:47:35

为什么这个正则表达式没有给出预期的输出？

问题描述

3 个解决方案

解决方案1 4 2012-12-13 18:18:15

解决方案2 1 已采纳 2012-12-13 19:52:33

解决方案3 0 2012-12-15 15:47:35

解决方案1
4 2012-12-13 18:18:15

解决方案2
1 已采纳 2012-12-13 19:52:33

解决方案3
0 2012-12-15 15:47:35