在Notepad ++中用正则表达式查找和替换标题标签

Question

There's a OCR scanned book and there's a tool which converts the OCR'd PDF to XML but most of the XML tags are wrong so there's another tool to fix it. 有一本OCR扫描的书，还有一个将OCR的PDF转换成XML的工具，但是大多数XML标签是错误的，因此还有另一种工具可以修复它。 But I need to break the lines from <h1> to <h5> , 1. & 1.1. 但是我需要从<h1>到<h5> ，1.和1.1断开行。 & 1.1.1. ＆1.1.1。 so its easy to re-tag using the tool. 因此使用该工具很容易重新标记。

The XML code looks like this: XML代码如下所示：

`<h1>text</h1><h2>text</h3><h3>text</h3>"

and 和

1.text.2.text.3.text.1.1.text.1.1.1.text

And I need to break the lines like this using a Regex in notepad++. 而且我需要在记事本++中使用正则表达式来打破这样的界限。

<h1>text</h1>
<h2>text</h2>
<h3>text</h3>

and 和

1.text.
2.text.
3.text.

and 和

1.1.text.
1.1.1.text.

I used </h1>\\s* to find an </h1>\\n but it only breaks h1 tags. 我使用</h1>\\s*来找到</h1>\\n但它只会破坏h1标签。 I need to break all "H" tags and 1., 2., 1.1., 1.1.1. 我需要破坏所有的“ H”标签和1.，2.，1.1。，1.1.1。 tags too. 标签。

Answer 1

At the risk of getting downvoted, i think you may be better served by a parser. 冒着被低估的风险，我认为解析器可能会更好地为您服务。 In the past when I've had to manage similar tasks, I would write a small script/program to parse the file and re-write it as needed. 过去，当我不得不管理类似的任务时，我会编写一个小的脚本/程序来解析文件并根据需要重新编写。 Parsing the xml first, and then reformatting using regex might be easier to accomplish your goal. 首先解析xml，然后使用正则表达式重新格式化可能更容易实现目标。

Answer 2

You can use this search and replace (if your h1, h2, ... tags don't contain other tags) : 您可以使用此搜索并替换（如果您的h1，h2，...标签不包含其他标签） ：

search:  (?<!^)(<h[1-6][^<]*|(?<![0-9]\.)[0-9]+\.)
replace: \n$1

note: if you need Windows newlines, you must change \\n with \\r\\n . 注意：如果你需要使用Windows换行，必须更改\\n与\\r\\n 。

pattern details: 图案细节：

(?<!^)   # not preceded by the begining of the string

(                         # open the capture group 1
    <h[1-6][^<]*          # <h, a digit between 1 to 6, all characters until 
                          # the next < (to skip all the content between
                          # h1, h2... tags) 
  |                     # OR
    (?<![0-9]\.)[0-9]+\.  # one or more digits and a dot not preceded by a digit
                          # and a dot 
)                         # close the capture group 1

$1 is a reference to the content of the capture group 1 $1是对捕获组1内容的引用

在Notepad ++中用正则表达式查找和替换标题标签

问题描述

2 个解决方案

解决方案1
1 2014-06-01 17:01:18

解决方案2
0 已采纳 2014-06-01 18:35:43

在Notepad ++中用正则表达式查找和替换标题标签

问题描述

2 个解决方案

解决方案1 1 2014-06-01 17:01:18

解决方案2 0 已采纳 2014-06-01 18:35:43

解决方案1
1 2014-06-01 17:01:18

解决方案2
0 已采纳 2014-06-01 18:35:43