简体   繁体   中英

Find and replace Heading tags with regex in Notepad++

There's a OCR scanned book and there's a tool which converts the OCR'd PDF to XML but most of the XML tags are wrong so there's another tool to fix it. But I need to break the lines from <h1> to <h5> , 1. & 1.1. & 1.1.1. so its easy to re-tag using the tool.

The XML code looks like this:

`<h1>text</h1><h2>text</h3><h3>text</h3>"

and

1.text.2.text.3.text.1.1.text.1.1.1.text 

And I need to break the lines like this using a Regex in notepad++.

<h1>text</h1>
<h2>text</h2>
<h3>text</h3>

and

1.text.
2.text.
3.text.

and

1.1.text.
1.1.1.text.

I used </h1>\\s* to find an </h1>\\n but it only breaks h1 tags. I need to break all "H" tags and 1., 2., 1.1., 1.1.1. tags too.

At the risk of getting downvoted, i think you may be better served by a parser. In the past when I've had to manage similar tasks, I would write a small script/program to parse the file and re-write it as needed. Parsing the xml first, and then reformatting using regex might be easier to accomplish your goal.

You can use this search and replace (if your h1, h2, ... tags don't contain other tags) :

search:  (?<!^)(<h[1-6][^<]*|(?<![0-9]\.)[0-9]+\.)
replace: \n$1

note: if you need Windows newlines, you must change \\n with \\r\\n .

pattern details:

(?<!^)   # not preceded by the begining of the string

(                         # open the capture group 1
    <h[1-6][^<]*          # <h, a digit between 1 to 6, all characters until 
                          # the next < (to skip all the content between
                          # h1, h2... tags) 
  |                     # OR
    (?<![0-9]\.)[0-9]+\.  # one or more digits and a dot not preceded by a digit
                          # and a dot 
)                         # close the capture group 1

$1 is a reference to the content of the capture group 1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM