為什么這個正則表達式沒有給出預期的輸出？

Question

我有包含下面給出的一些值的字符串。 我想用一些新文本替換包含特定customerId的html img標簽 。 我嘗試了小的Java程序，它沒有給我預期的輸出。這是程序信息

我的輸入字符串是

 String inputText = "Starting here.. <img src=\"getCustomers.do?custCode=2&customerId=3334&param1=123/></p>"
    + "<p>someText</p><img src=\"getCustomers.do?custCode=2&customerId=3340&param2=456/> ..Ending here";

正則表達式是

  String regex = "(?s)\\<img.*?customerId=3340.*?>";

我想放在輸入字符串中的新文本

編輯開始：

String newText = "<img src=\"getCustomerNew.do\">";

編輯結束：

現在我在做

  String outputText = inputText.replaceAll(regex, newText);

輸出是

 Starting here.. Replacing Text ..Ending here

但我的預期輸出是

 Starting here.. <img src=\"getCustomers.do?custCode=2&customerId=3334&param1=123/></p><p>someText</p>Replacing Text ..Ending here

請注意，在我的預期輸出中，僅將包含customerId = 3340的img標簽替換為“替換文本”。 我不明白為什么在輸出中我同時得到了兩個img標簽？

Answer 1

您在其中具有“通配符” /“任何”模式（ .* ），它將匹配項擴展到可能的最長匹配字符串，並且模式中的最后一個固定文本為>字符，因此與最后一個>字符匹配在輸入文本中，即最后一個！

您應該可以通過將.*部分更改為[^>]+類的內容來解決此問題，以使匹配不會超出第一個>字符。

使用正則表達式解析HTML勢必會引起麻煩。

Answer 2

正如其他人在評論中告訴您的那樣，HTML不是常規語言，因此使用正則表達式進行操作通常很痛苦。 最好的選擇是使用HTML解析器。 我以前沒有使用過Jsoup，但是在谷歌搜索了一下之后，您似乎需要這樣的東西：

import org.jsoup.*;
import org.jsoup.nodes.*;
import org.jsoup.select.*;

public class MyJsoupExample {
    public static void main(String args[]) {
        String inputText = "<html><head></head><body><p><img src=\"getCustomers.do?custCode=2&customerId=3334&param1=123\"/></p>"
            + "<p>someText <img src=\"getCustomers.do?custCode=2&customerId=3340&param2=456\"/></p></body></html>";
        Document doc = Jsoup.parse(inputText);
        Elements myImgs = doc.select("img[src*=customerId=3340");
        for (Element element : myImgs) {
            element.replaceWith(new TextNode("my replaced text", ""));
        }
        System.out.println(doc.toString());
    }
}

基本上，代碼獲取具有包含給定字符串的src屬性的img節點列表。

Elements myImgs = doc.select("img[src*=customerId=3340");

然后遍歷列表，並用一些文本替換那些節點。

UPDATE

如果您不想將整個img節點替換為文本，而是需要為其src屬性賦予一個新值，則可以將for循環的塊替換for ：

element.attr("src", "my new value"));

或者，如果您只想更改src值的一部分，則可以執行以下操作：

String srcValue = element.attr("src");
element.attr("src", srcValue.replace("getCustomers.do", "getCustonerNew.do"));

這與我在此主題中發布的內容非常相似。

Answer 3

發生的情況是，您的正則表達式開始與第一個img標簽匹配，然后消耗所有內容（無論是否貪婪），直到找到customerId = 3340 ，然后繼續消耗所有內容，直到找到>為止。

如果您只想使用customerId = 3340的img ，請考慮一下此標簽與可能匹配的其他標簽有何不同。

在這種特殊情況下，一種可能的解決方案是使用后向運算符（不消耗匹配項）查看img標記后面的內容。 此正則表達式將起作用：

String regex = "(?<=</p>)<img src=\".*?customerId=3340.*?>";

為什么這個正則表達式沒有給出預期的輸出？

問題描述

3 個解決方案

解決方案1
4 2012-12-13 18:18:15

解決方案2
1 已采納 2012-12-13 19:52:33

解決方案3
0 2012-12-15 15:47:35

為什么這個正則表達式沒有給出預期的輸出？

問題描述

3 個解決方案

解決方案1 4 2012-12-13 18:18:15

解決方案2 1 已采納 2012-12-13 19:52:33

解決方案3 0 2012-12-15 15:47:35

解決方案1
4 2012-12-13 18:18:15

解決方案2
1 已采納 2012-12-13 19:52:33

解決方案3
0 2012-12-15 15:47:35