在Notepad ++中用正則表達式查找和替換標題標簽

Question

有一本OCR掃描的書，還有一個將OCR的PDF轉換成XML的工具，但是大多數XML標簽是錯誤的，因此還有另一種工具可以修復它。 但是我需要從<h1>到<h5> ，1.和1.1斷開行。 ＆1.1.1。 因此使用該工具很容易重新標記。

XML代碼如下所示：

`<h1>text</h1><h2>text</h3><h3>text</h3>"

和

1.text.2.text.3.text.1.1.text.1.1.1.text

而且我需要在記事本++中使用正則表達式來打破這樣的界限。

<h1>text</h1>
<h2>text</h2>
<h3>text</h3>

和

1.text.
2.text.
3.text.

和

1.1.text.
1.1.1.text.

我使用</h1>\\s*來找到</h1>\\n但它只會破壞h1標簽。 我需要破壞所有的“ H”標簽和1.，2.，1.1。，1.1.1。 標簽。

Answer 1

冒着被低估的風險，我認為解析器可能會更好地為您服務。 過去，當我不得不管理類似的任務時，我會編寫一個小的腳本/程序來解析文件並根據需要重新編寫。 首先解析xml，然后使用正則表達式重新格式化可能更容易實現目標。

Answer 2

: 您可以使用此搜索並替換：

search:  (?<!^)(<h[1-6][^<]*|(?<![0-9]\.)[0-9]+\.)
replace: \n$1

注意：如果你需要使用Windows換行，必須更改\\n與\\r\\n 。

圖案細節：

(?<!^)   # not preceded by the begining of the string

(                         # open the capture group 1
    <h[1-6][^<]*          # <h, a digit between 1 to 6, all characters until 
                          # the next < (to skip all the content between
                          # h1, h2... tags) 
  |                     # OR
    (?<![0-9]\.)[0-9]+\.  # one or more digits and a dot not preceded by a digit
                          # and a dot 
)                         # close the capture group 1

$1是對捕獲組1內容的引用

在Notepad ++中用正則表達式查找和替換標題標簽

問題描述

2 個解決方案

解決方案1
1 2014-06-01 17:01:18

解決方案2
0 已采納 2014-06-01 18:35:43

在Notepad ++中用正則表達式查找和替換標題標簽

問題描述

2 個解決方案

解決方案1 1 2014-06-01 17:01:18

解決方案2 0 已采納 2014-06-01 18:35:43

解決方案1
1 2014-06-01 17:01:18

解決方案2
0 已采納 2014-06-01 18:35:43