简体   繁体   English

在Notepad ++中用正则表达式查找和替换标题标签

[英]Find and replace Heading tags with regex in Notepad++

There's a OCR scanned book and there's a tool which converts the OCR'd PDF to XML but most of the XML tags are wrong so there's another tool to fix it. 有一本OCR扫描的书,还有一个将OCR的PDF转换成XML的工具,但是大多数XML标签是错误的,因此还有另一种工具可以修复它。 But I need to break the lines from <h1> to <h5> , 1. & 1.1. 但是我需要从<h1><h5> ,1.和1.1断开行。 & 1.1.1. &1.1.1。 so its easy to re-tag using the tool. 因此使用该工具很容易重新标记。

The XML code looks like this: XML代码如下所示:

`<h1>text</h1><h2>text</h3><h3>text</h3>"

and

1.text.2.text.3.text.1.1.text.1.1.1.text 

And I need to break the lines like this using a Regex in notepad++. 而且我需要在记事本++中使用正则表达式来打破这样的界限。

<h1>text</h1>
<h2>text</h2>
<h3>text</h3>

and

1.text.
2.text.
3.text.

and

1.1.text.
1.1.1.text.

I used </h1>\\s* to find an </h1>\\n but it only breaks h1 tags. 我使用</h1>\\s*来找到</h1>\\n但它只会破坏h1标签。 I need to break all "H" tags and 1., 2., 1.1., 1.1.1. 我需要破坏所有的“ H”标签和1.,2.,1.1。,1.1.1。 tags too. 标签。

At the risk of getting downvoted, i think you may be better served by a parser. 冒着被低估的风险,我认为解析器可能会更好地为您服务。 In the past when I've had to manage similar tasks, I would write a small script/program to parse the file and re-write it as needed. 过去,当我不得不管理类似的任务时,我会编写一个小的脚本/程序来解析文件并根据需要重新编写。 Parsing the xml first, and then reformatting using regex might be easier to accomplish your goal. 首先解析xml,然后使用正则表达式重新格式化可能更容易实现目标。

You can use this search and replace (if your h1, h2, ... tags don't contain other tags) : 您可以使用此搜索并替换(如果您的h1,h2,...标签不包含其他标签)

search:  (?<!^)(<h[1-6][^<]*|(?<![0-9]\.)[0-9]+\.)
replace: \n$1

note: if you need Windows newlines, you must change \\n with \\r\\n . 注意:如果你需要使用Windows换行,必须更改\\n\\r\\n

pattern details: 图案细节:

(?<!^)   # not preceded by the begining of the string

(                         # open the capture group 1
    <h[1-6][^<]*          # <h, a digit between 1 to 6, all characters until 
                          # the next < (to skip all the content between
                          # h1, h2... tags) 
  |                     # OR
    (?<![0-9]\.)[0-9]+\.  # one or more digits and a dot not preceded by a digit
                          # and a dot 
)                         # close the capture group 1

$1 is a reference to the content of the capture group 1 $1是对捕获组1内容的引用

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM