使用正則表達式刪除MS Word鏈接

Question

我正在解析MS Word文檔，並使用Apache POI獲取文本。

對於如下所示的段落：

最受歡迎的水果是蘋果和香蕉（請參見下面的“ 普通水果 ”部分和“ 詳細植物學描述 ”小節）。

我得到一個看起來像這樣的字符串：

The most popular fruits were apples and bananas (see section '\ HYPERLINK \\\\l "_Common_fruit_types\\" \\Common fruits\' and subsection '\ HYPERLINK \\\\l \\"_Botanic_description\\" \\Detailed botanic descriptions\' below).

也有使用“ PAGEREF”而不是“ HYPERLINK”的不同類型的標記或關鍵字，但似乎它們始終遵循\ TAGWORD {String1} \\{String2}\

所以我想做的是除去{String2}所有內容。 到目前為止，我已經完成了：

RegEx模式\(.*?)\結果： {String2}\ （從SO頁面上找不到了，我找不到了）
RegEx模式\\\\[A-Za-z0-9]+刪除最終的\什么也沒發生。 我想表達的是，刪除單詞（包含字符和數字），包括其后的反斜杠。 還嘗試了\\\\\\\\[A-Za-z0-9]+ ，結果相同。
RegEx模式\(.*?)u0015刪除了整個鏈接結構
由於\(.*?)\(.*?)\所做的相同（全部刪除），因此我嘗試了\(.*?)\[^(.*?)]\ ，但確實如此沒有。

備選： While循環

boolean textWasChanged = true;
while (textWasChanged) {
    int idx1 = text.indexOf("\u0013");
    int idx2 = text.indexOf("\u0014", idx1);
    if (idx1 > -1 && idx2 > -1 && text.replace(text.substring(idx1, idx2+1), "").length() < text.length()) {
        textWasChanged = true;
        text = text.replace(text.substring(idx1, idx2+1), "");
    } else {
        textWasChanged = false;
    }

}
text = text.replaceAll("\u0015", "");

手動清除有效，但是我想知道是否可以將其簡化為單線或其他形式。

或更具體：

如何編寫僅保留{String2}的正則表達式模式？ 從正則表達式手冊看來，這是可能的。 我只是不能把頭纏住它。
我在步驟2和/或4中的錯誤在哪里？ 我只是否定了(.*?)部分，bc是我要保留的部分。 但是我顯然不明白正則表達式。

Answer 1

您可以使用以下Pattern替換您的實體：

String raw = "The most popular fruits were apples and bananas "
        + "(see section ‘\\u0013 HYPERLINK \\l \"_Common_fruit_types\\\" "
        + "\\u0001\\u0014Common fruits\\u0015’ and subsection ‘\\u0013 HYPERLINK \\l"
        + "\\\"_Botanic_description\\\" "
        + "\\u0001\\u0014Detailed botanic descriptions\\u0015’ below).";

// test
System.out.printf("Raw string: %s%n%n", raw);
//                           | escaped back slash
//                           | | escaped unicode point
//                           | |      | any 1+ character, reluctant
//                           | |      |  | escaped \ and unicode point
//                           | |      |  |        | group 1: your goal
//                           | |      |  |        |    | escaped final \ + unicode point
Pattern p = Pattern.compile("\\\\u0013.+?\\\\u0014(.+?)\\\\u0015");
Matcher m = p.matcher(raw);
while (m.find()) {
    System.out.printf("Found: %s%n", m.group(1));
}
System.out.println();

// actual replacement
System.out.printf(
    "Replaced: %s%n", 
    raw.replaceAll("\\\\u0013.+?\\\\u0014(.+?)\\\\u0015", "$1")
);

輸出（為清楚起見人工添加了換行符）

Raw string: The most popular fruits were apples and bananas (see section 
‘\u0013 HYPERLINK \l "_Common_fruit_types\" \u0001\u0014Common fruits\u0015’ 
and subsection ‘\u0013 HYPERLINK \l\"_Botanic_description\" 
\u0001\u0014Detailed botanic descriptions\u0015’ below).

Found: Common fruits
Found: Detailed botanic descriptions

Replaced: The most popular fruits were apples and bananas 
(see section ‘Common fruits’ and subsection ‘Detailed botanic descriptions’ below).

使用正則表達式刪除MS Word鏈接

問題描述

1 個解決方案

解決方案1
1 已采納 2015-08-26 12:04:24

使用正則表達式刪除MS Word鏈接

問題描述

1 個解決方案

解決方案1 1 已采納 2015-08-26 12:04:24

解決方案1
1 已采納 2015-08-26 12:04:24